Thread: Replication terminated due to PANIC

Replication terminated due to PANIC

From

Adarsh Sharma

Date:

25 April 2013, 03:05:23

Hi all,

I have a Postgresql 9.2 instance running on a CentOS6.3 box.Yesterday i setup a hot standby by using pgbasebackup. Today i got the below alert from standby box :

[1] (from line 412,723)
2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC: _bt_restore_page: cannot add item to page

When i check, the replication is terminated due to slave DB shutdown. From the logs i can see below messages :-

2013-04-24 23:17:16 UTC [26989]: [5360083-1] user= db= host= ERROR: could not open file "global/14078": No such file or directory
2013-04-24 23:17:16 UTC [26989]: [5360084-1] user= db= host= CONTEXT: writing block 0 of relation global/14078
2013-04-24 23:17:16 UTC [26989]: [5360085-1] user= db= host= WARNING: could not write block 0 of global/14078
2013-04-24 23:17:16 UTC [26989]: [5360086-1] user= db= host= DETAIL: Multiple failures --- write error might be permanent.

I checked in global directory of master, the directory 14078 doesn't exist.

Anyone has faced above issue ?

Thanks

Re: Replication terminated due to PANIC

From

Sergey Konoplev

Date:

25 April 2013, 05:44:54

On Wed, Apr 24, 2013 at 5:05 PM, Adarsh Sharma <eddy.adarsh@gmail.com> wrote:
> I have a Postgresql 9.2 instance running on a CentOS6.3 box.Yesterday i
> setup a hot standby by using pgbasebackup. Today i got the below  alert from
> standby box :
>
> [1] (from line 412,723)
> 2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC:
> _bt_restore_page: cannot add item to page
>
> When i check, the replication is terminated due to slave DB shutdown. From
> the logs i can see below messages :-

I am not sure that it is your situation but take a look at this thread:

http://www.postgresql.org/message-id/CAL_0b1t=WuM6roO8dki=w8DhH8P8whhohbPjReymmQUrOcNT2A@mail.gmail.com

There is a patch by Andres Freund in the end of the discussion. Three
weeks have passed after I installed the patched version and it looks
like the patch fixed my issue.

>
> 2013-04-24 23:17:16 UTC [26989]: [5360083-1] user= db= host= ERROR:  could
> not open file "global/14078": No such file or directory
> 2013-04-24 23:17:16 UTC [26989]: [5360084-1] user= db= host= CONTEXT:
> writing block 0 of relation global/14078
> 2013-04-24 23:17:16 UTC [26989]: [5360085-1] user= db= host= WARNING:  could
> not write block 0 of global/14078
> 2013-04-24 23:17:16 UTC [26989]: [5360086-1] user= db= host= DETAIL:
> Multiple failures --- write error might be permanent.
>
> I checked in global directory of master, the directory 14078 doesn't exist.
>
> Anyone has faced above issue ?
>
> Thanks



--
Kind regards,
Sergey Konoplev
Database and Software Consultant

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray.ru@gmail.com

Re: Replication terminated due to PANIC

From

Adarsh Sharma

Date:

25 April 2013, 08:46:45

Thanks Sergey for such a quick response, but i dont think this is some patch problem because we have other DB servers also running fine on same version and message is also different :

host= PANIC: _bt_restore_page: cannot add item to page

And the whole day replication is working fine but at midnight when log rotates it shows belows msg :

2013-04-24 00:00:00 UTC [26989]: [4945032-1] user= db= host= LOG:  checkpoint starting: time
2013-04-24 00:00:00 UTC [26989]: [4945033-1] user= db= host= ERROR:  could not open file "global/14078": No such file or directory

2013-04-24 00:00:00 UTC [26989]: [4945034-1] user= db= host= CONTEXT:  writing block 0 of relation global/14078
2013-04-24 00:00:00 UTC [26989]: [4945035-1] user= db= host= WARNING:  could not write block 0 of global/14078

2013-04-24 00:00:00 UTC [26989]: [4945036-1] user= db= host= DETAIL:  Multiple failures --- write error might be permanent.

Looks like some index corruption.

Thanks

On Thu, Apr 25, 2013 at 8:14 AM, Sergey Konoplev <gray.ru@gmail.com> wrote:

On Wed, Apr 24, 2013 at 5:05 PM, Adarsh Sharma <eddy.adarsh@gmail.com> wrote:
> I have a Postgresql 9.2 instance running on a CentOS6.3 box.Yesterday i
> setup a hot standby by using pgbasebackup. Today i got the below alert from
> standby box :
>
> [1] (from line 412,723)
> 2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC:
> _bt_restore_page: cannot add item to page
>
> When i check, the replication is terminated due to slave DB shutdown. From
> the logs i can see below messages :-

I am not sure that it is your situation but take a look at this thread:

http://www.postgresql.org/message-id/CAL_0b1t=WuM6roO8dki=w8DhH8P8whhohbPjReymmQUrOcNT2A@mail.gmail.com

There is a patch by Andres Freund in the end of the discussion. Three
weeks have passed after I installed the patched version and it looks
like the patch fixed my issue.

>
> 2013-04-24 23:17:16 UTC [26989]: [5360083-1] user= db= host= ERROR: could
> not open file "global/14078": No such file or directory
> 2013-04-24 23:17:16 UTC [26989]: [5360084-1] user= db= host= CONTEXT:
> writing block 0 of relation global/14078
> 2013-04-24 23:17:16 UTC [26989]: [5360085-1] user= db= host= WARNING: could
> not write block 0 of global/14078
> 2013-04-24 23:17:16 UTC [26989]: [5360086-1] user= db= host= DETAIL:
> Multiple failures --- write error might be permanent.
>
> I checked in global directory of master, the directory 14078 doesn't exist.
>
> Anyone has faced above issue ?
>
> Thanks

--
Kind regards,
Sergey Konoplev
Database and Software Consultant

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray.ru@gmail.com

Re: Replication terminated due to PANIC

From

Lonni J Friedman

Date:

25 April 2013, 16:57:36

If its really index corruption, then you should be able to fix it by
reindexing.  However, that doesn't explain what caused the corruption.
 Perhaps your hardware is bad in some way?

On Wed, Apr 24, 2013 at 10:46 PM, Adarsh Sharma <eddy.adarsh@gmail.com> wrote:
> Thanks Sergey for such a quick response, but i dont think this is some patch
> problem because we have other DB servers also running fine on same version
> and message is also different :
>
> host= PANIC: _bt_restore_page: cannot add item to page
>
> And the whole day replication is working fine but at midnight when log
> rotates it shows belows msg :
>
> 2013-04-24 00:00:00 UTC [26989]: [4945032-1] user= db= host= LOG:
> checkpoint starting: time
> 2013-04-24 00:00:00 UTC [26989]: [4945033-1] user= db= host= ERROR:  could
> not open file "global/14078": No such file or directory
>
> 2013-04-24 00:00:00 UTC [26989]: [4945034-1] user= db= host= CONTEXT:
> writing block 0 of relation global/14078
> 2013-04-24 00:00:00 UTC [26989]: [4945035-1] user= db= host= WARNING:  could
> not write block 0 of global/14078
>
> 2013-04-24 00:00:00 UTC [26989]: [4945036-1] user= db= host= DETAIL:
> Multiple failures --- write error might be permanent.
>
> Looks like some index corruption.
>
>
> Thanks
>
>
>
>
>
>
> On Thu, Apr 25, 2013 at 8:14 AM, Sergey Konoplev <gray.ru@gmail.com> wrote:
>>
>> On Wed, Apr 24, 2013 at 5:05 PM, Adarsh Sharma <eddy.adarsh@gmail.com>
>> wrote:
>> > I have a Postgresql 9.2 instance running on a CentOS6.3 box.Yesterday i
>> > setup a hot standby by using pgbasebackup. Today i got the below  alert
>> > from
>> > standby box :
>> >
>> > [1] (from line 412,723)
>> > 2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC:
>> > _bt_restore_page: cannot add item to page
>> >
>> > When i check, the replication is terminated due to slave DB shutdown.
>> > From
>> > the logs i can see below messages :-
>>
>> I am not sure that it is your situation but take a look at this thread:
>>
>>
>> http://www.postgresql.org/message-id/CAL_0b1t=WuM6roO8dki=w8DhH8P8whhohbPjReymmQUrOcNT2A@mail.gmail.com
>>
>> There is a patch by Andres Freund in the end of the discussion. Three
>> weeks have passed after I installed the patched version and it looks
>> like the patch fixed my issue.
>>
>> >
>> > 2013-04-24 23:17:16 UTC [26989]: [5360083-1] user= db= host= ERROR:
>> > could
>> > not open file "global/14078": No such file or directory
>> > 2013-04-24 23:17:16 UTC [26989]: [5360084-1] user= db= host= CONTEXT:
>> > writing block 0 of relation global/14078
>> > 2013-04-24 23:17:16 UTC [26989]: [5360085-1] user= db= host= WARNING:
>> > could
>> > not write block 0 of global/14078
>> > 2013-04-24 23:17:16 UTC [26989]: [5360086-1] user= db= host= DETAIL:
>> > Multiple failures --- write error might be permanent.
>> >
>> > I checked in global directory of master, the directory 14078 doesn't
>> > exist.
>> >
>> > Anyone has faced above issue ?
>> >
>> > Thanks
>>
>>
>>
>> --
>> Kind regards,
>> Sergey Konoplev
>> Database and Software Consultant
>>
>> Profile: http://www.linkedin.com/in/grayhemp
>> Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
>> Skype: gray-hemp
>> Jabber: gray.ru@gmail.com
>
>



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org

Re: Replication terminated due to PANIC

From

Andres Freund

Date:

25 April 2013, 17:16:38

On 2013-04-24 19:44:25 -0700, Sergey Konoplev wrote:
> On Wed, Apr 24, 2013 at 5:05 PM, Adarsh Sharma <eddy.adarsh@gmail.com> wrote:
> > I have a Postgresql 9.2 instance running on a CentOS6.3 box.Yesterday i
> > setup a hot standby by using pgbasebackup. Today i got the below  alert from
> > standby box :
> >
> > [1] (from line 412,723)
> > 2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC:
> > _bt_restore_page: cannot add item to page
> >
> > When i check, the replication is terminated due to slave DB shutdown. From
> > the logs i can see below messages :-

Does the global/14078 file exist on the primary? What exact commandline
were you using to restore? Which exact version of postgres?

> I am not sure that it is your situation but take a look at this thread:
>
> http://www.postgresql.org/message-id/CAL_0b1t=WuM6roO8dki=w8DhH8P8whhohbPjReymmQUrOcNT2A@mail.gmail.com
>
> There is a patch by Andres Freund in the end of the discussion.

The issues don't look related.

> Three
> weeks have passed after I installed the patched version and it looks
> like the patch fixed my issue.

Oh, cool! Thanks for verifying.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Replication terminated due to PANIC

From

Adarsh Sharma

Date:

26 April 2013, 07:22:13

Sorry my bad , didn't mention the full DB version :

9.2.4.8 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit

Apart from these i am happy to inform , the issue is fixed now. Actually there are two Slave set up's on the standby box on different ports and are two stale processes ( logger and writer ) that are running with different parent id's on the box. After killing the processes and reloading conf file, db server is replaying logs properly.

@Andres : No the directory doesn't exist on master but exists on the other standby.

@Lonni , i was guessing because of the below message in the logs:- 

_bt_restore_page: cannot add item to page

http://en.verysource.com/code/5191515_1/nbtxlog.c.html
Yes we faced H/w issues in master and we flip to slave and setup a new SR in which we are facing this issue.

Still don't know why this PANIC message came. Anywaz thanks u all for giving your crucial time into it.

Thanks

On Thu, Apr 25, 2013 at 7:46 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-04-24 19:44:25 -0700, Sergey Konoplev wrote:
> On Wed, Apr 24, 2013 at 5:05 PM, Adarsh Sharma <eddy.adarsh@gmail.com> wrote:
> > I have a Postgresql 9.2 instance running on a CentOS6.3 box.Yesterday i
> > setup a hot standby by using pgbasebackup. Today i got the below alert from
> > standby box :
> >
> > [1] (from line 412,723)
> > 2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC:
> > _bt_restore_page: cannot add item to page
> >
> > When i check, the replication is terminated due to slave DB shutdown. From
> > the logs i can see below messages :-

Does the global/14078 file exist on the primary? What exact commandline
were you using to restore? Which exact version of postgres?

> I am not sure that it is your situation but take a look at this thread:
>
> http://www.postgresql.org/message-id/CAL_0b1t=WuM6roO8dki=w8DhH8P8whhohbPjReymmQUrOcNT2A@mail.gmail.com
>
> There is a patch by Andres Freund in the end of the discussion.

The issues don't look related.

> Three
> weeks have passed after I installed the patched version and it looks
> like the patch fixed my issue.

Oh, cool! Thanks for verifying.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services