Thread: Assertion failure twophase.c (2) (testing HS/SR)

Assertion failure twophase.c (2) (testing HS/SR)

From
"Erik Rijkers"
Date:
in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30

With three patches:
 new_smart_shutdown_20100201.patch extend_format_of_recovery_info_funcs_v4.20100303.patch
fix-KnownAssignedXidsRemoveMany-1.patch
 pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary

FailedAssertion, File: "twophase.c", Line: 1201.

The standby was restarted and seems to catch up OK again.


LOG:  database system was interrupted; last known up at 2010-03-04 01:35:23 CET
cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000001':
No such file or directory
FATAL:  the database system is starting up
LOG:  entering standby mode
LOG:  redo starts at 0/1000020
LOG:  consistent recovery state reached at 0/2000000
LOG:  database system is ready to accept read only connections
ERROR:  cannot execute CREATE TABLE in a read-only transaction
STATEMENT:  create table t (c text);
ERROR:  cannot execute SELECT INTO in a read-only transaction
STATEMENT:  create table t as select 1;
ERROR:  cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT:  truncate wal;
ERROR:  cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT:  truncate wal;
ERROR:  cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT:  truncate wal;
ERROR:  cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT:  truncate wal;
ERROR:  cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT:  truncate wal;
ERROR:  cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT:  truncate wal;
ERROR:  cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT:  truncate wal;
FATAL:  database "ms" does not exist
TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File: "twophase.c", Line: 1201)
LOG:  startup process (PID 18107) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
LOG:  database system was interrupted while in recovery at log time 2010-03-04 05:00:24 CET
HINT:  If this has occurred more than once some data might be corrupted and you might need to
choose an earlier recovery target.
cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000001C0000007F':
No such file or directory
LOG:  entering standby mode
LOG:  redo starts at 1C/7800F8E0
LOG:  consistent recovery state reached at 1C/ADB9C178
LOG:  database system is ready to accept read only connections

The ERRORs (and FATAL) were accidentally issued commands; I can't tell if they were causing the
assertion. (database 'ms' indeed was not present on this instance)

see also:
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02221.php


thanks,

Erik Rijkers




Re: Assertion failure twophase.c (2) (testing HS/SR)

From
Heikki Linnakangas
Date:
Erik Rijkers wrote:
> in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30
> 
> With three patches:
> 
>   new_smart_shutdown_20100201.patch
>   extend_format_of_recovery_info_funcs_v4.20100303.patch

Got a link to these two patches? I couldn't find them with a quick search.

>   fix-KnownAssignedXidsRemoveMany-1.patch
> 
>   pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary
> 
> FailedAssertion, File: "twophase.c", Line: 1201.
> 
> The standby was restarted and seems to catch up OK again.
> ...
> see also:
> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02221.php

I'm still not any wiser on what's causing that, but I've fixed the bug
in KnownAssignedXidsMany() now.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Assertion failure twophase.c (2) (testing HS/SR)

From
Fujii Masao
Date:
On Thu, Mar 11, 2010 at 6:29 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Erik Rijkers wrote:
>> in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30
>>
>> With three patches:
>>
>>   new_smart_shutdown_20100201.patch

http://archives.postgresql.org/pgsql-hackers/2010-01/msg03116.php

>>   extend_format_of_recovery_info_funcs_v4.20100303.patch

http://archives.postgresql.org/pgsql-hackers/2010-03/msg00175.php

> Got a link to these two patches? I couldn't find them with a quick search.

For your convenience, I attached those patches in this post.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: Assertion failure twophase.c (2) (testing HS/SR)

From
Simon Riggs
Date:
On Thu, 2010-03-11 at 11:29 +0200, Heikki Linnakangas wrote:

> I'm still not any wiser on what's causing that, but I've fixed the bug
> in KnownAssignedXidsMany() now.

Yeh, I've been mulling this over for a few days now and I can't see a
way that could have happened either.

I agree with your fix and the stronger placement of the Assertion.
Thanks.

I will be doing some further investigation in that area as well, over
next week or so.

-- Simon Riggs           www.2ndQuadrant.com



Re: Assertion failure twophase.c (3) (testing HS/SR)

From
"Erik Rijkers"
Date:
On Thu, March 4, 2010 17:00, Erik Rijkers wrote:
> in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30
>
> With three patches:
>
>   new_smart_shutdown_20100201.patch
>   extend_format_of_recovery_info_funcs_v4.20100303.patch
>   fix-KnownAssignedXidsRemoveMany-1.patch
>
>   pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary
>
> FailedAssertion, File: "twophase.c", Line: 1201.
>

For the record, this still happens (FailedAssertion, File: "twophase.c", Line: 1201.)
(created 2010.03.13 23:49 cvs).

Unfortunately, it does not happen always, or predictably.

patches:new_smart_shutdown_20100201.patchextend_format_of_recovery_info_funcs_v4.20100303.patch(both here:
http://archives.postgresql.org/pgsql-hackers/2010-03/msg00446.php)
 
 (fix-KnownAssignedXidsRemoveMany-1.patch has been committed, I think?)


I use commandlines like this to copy schemas across from 8.4.2 to 9.0devel:
pg_dump -c -h /tmp -p 5432 -n myschema --no-owner --no-privileges mydb \ | psql -1qtA -h /tmp -p 7575 -d replicas

(the copied schemas were together 175 GB)

As I seem to be the only one who finds this, I started looking what could be unique in this
install: and it would be postbio, which we use for its gist-indexing on ranges
(http://pgfoundry.org/projects/postbio/).  We use postbio's int_interval type as a column type. 
But keep in mind that sometimes the whole dump+restore+replication completes OK.


Other installed modules are: contrib/btree_gist contrib/seg contrib/adminpack

log_line_prefix = '%t %p %d %u start=%s ' # slave

pgsql.sr_hotslave/logfile:

2010-03-13 23:54:59 CET 15765   start=2010-03-13 23:54:59 CET LOG:  database system was
interrupted; last known up at 2010-03-13 23:54:31 CET
cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000001':
No such file or directory
2010-03-13 23:55:00 CET 15765   start=2010-03-13 23:54:59 CET LOG:  entering standby mode
2010-03-13 23:55:00 CET 15765   start=2010-03-13 23:54:59 CET LOG:  redo starts at 0/1000020
2010-03-13 23:55:00 CET 15765   start=2010-03-13 23:54:59 CET LOG:  consistent recovery state
reached at 0/2000000
2010-03-13 23:55:00 CET 15763   start=2010-03-13 23:54:59 CET LOG:  database system is ready to
accept read only connections
TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File: "twophase.c", Line: 1201)
2010-03-14 05:28:59 CET 15763   start=2010-03-13 23:54:59 CET LOG:  startup process (PID 15765)
was terminated by signal 6: Aborted
2010-03-14 05:28:59 CET 15763   start=2010-03-13 23:54:59 CET LOG:  terminating any other active
server processes


Maybe I'll try now to setup a similar instance without postbio, to see if the crash still occurs.

hth,

Erik Rijkers




Re: Assertion failure twophase.c (3) (testing HS/SR)

From
Heikki Linnakangas
Date:
Can you still reproduce this or has some of the changes since then fixed
it? We never quite figured out the cause..

Erik Rijkers wrote:
> On Thu, March 4, 2010 17:00, Erik Rijkers wrote:
>> in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30
>>
>> With three patches:
>>
>>   new_smart_shutdown_20100201.patch
>>   extend_format_of_recovery_info_funcs_v4.20100303.patch
>>   fix-KnownAssignedXidsRemoveMany-1.patch
>>
>>   pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary
>>
>> FailedAssertion, File: "twophase.c", Line: 1201.
>>
> 
> For the record, this still happens (FailedAssertion, File: "twophase.c", Line: 1201.)
> (created 2010.03.13 23:49 cvs).
> 
> Unfortunately, it does not happen always, or predictably.
> 
> patches:
>  new_smart_shutdown_20100201.patch
>  extend_format_of_recovery_info_funcs_v4.20100303.patch
>  (both here: http://archives.postgresql.org/pgsql-hackers/2010-03/msg00446.php )
> 
>   (fix-KnownAssignedXidsRemoveMany-1.patch has been committed, I think?)
> 
> 
> I use commandlines like this to copy schemas across from 8.4.2 to 9.0devel:
> pg_dump -c -h /tmp -p 5432 -n myschema --no-owner --no-privileges mydb \
>   | psql -1qtA -h /tmp -p 7575 -d replicas
> 
> (the copied schemas were together 175 GB)
> 
> As I seem to be the only one who finds this, I started looking what could be unique in this
> install: and it would be postbio, which we use for its gist-indexing on ranges
> (http://pgfoundry.org/projects/postbio/).  We use postbio's int_interval type as a column type. 
> But keep in mind that sometimes the whole dump+restore+replication completes OK.
> 
> 
> Other installed modules are:
>   contrib/btree_gist
>   contrib/seg
>   contrib/adminpack
> 
> log_line_prefix = '%t %p %d %u start=%s ' # slave
> 
> pgsql.sr_hotslave/logfile:
> 
> 2010-03-13 23:54:59 CET 15765   start=2010-03-13 23:54:59 CET LOG:  database system was
> interrupted; last known up at 2010-03-13 23:54:31 CET
> cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000001':
> No such file or directory
> 2010-03-13 23:55:00 CET 15765   start=2010-03-13 23:54:59 CET LOG:  entering standby mode
> 2010-03-13 23:55:00 CET 15765   start=2010-03-13 23:54:59 CET LOG:  redo starts at 0/1000020
> 2010-03-13 23:55:00 CET 15765   start=2010-03-13 23:54:59 CET LOG:  consistent recovery state
> reached at 0/2000000
> 2010-03-13 23:55:00 CET 15763   start=2010-03-13 23:54:59 CET LOG:  database system is ready to
> accept read only connections
> TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File: "twophase.c", Line: 1201)
> 2010-03-14 05:28:59 CET 15763   start=2010-03-13 23:54:59 CET LOG:  startup process (PID 15765)
> was terminated by signal 6: Aborted
> 2010-03-14 05:28:59 CET 15763   start=2010-03-13 23:54:59 CET LOG:  terminating any other active
> server processes
> 
> 
> Maybe I'll try now to setup a similar instance without postbio, to see if the crash still occurs.
> 
> hth,
> 
> Erik Rijkers
> 
> 
> 


--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Assertion failure twophase.c (3) (testing HS/SR)

From
"Erik Rijkers"
Date:
On Thu, April 22, 2010 09:53, Heikki Linnakangas wrote:
> Can you still reproduce this or has some of the changes since then fixed
> it? We never quite figured out the cause..

I don't know for sure:
>> Unfortunately, it does not happen always, or predictably.

The only thing that I established after that email was sent,
is that the error can also occur without the postbio package
being been installed (this has happened once).

It's a very easy test; I will probably run it a few more times.


> Erik Rijkers wrote:
>> On Thu, March 4, 2010 17:00, Erik Rijkers wrote:
>>> in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30
>>>
>>> With three patches:
>>>
>>>   new_smart_shutdown_20100201.patch
>>>   extend_format_of_recovery_info_funcs_v4.20100303.patch
>>>   fix-KnownAssignedXidsRemoveMany-1.patch
>>>
>>>   pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary
>>>
>>> FailedAssertion, File: "twophase.c", Line: 1201.
>>>
>>
>> For the record, this still happens (FailedAssertion, File: "twophase.c", Line: 1201.)
>> (created 2010.03.13 23:49 cvs).
>>
>> Unfortunately, it does not happen always, or predictably.
>>
>> patches:
>>  new_smart_shutdown_20100201.patch
>>  extend_format_of_recovery_info_funcs_v4.20100303.patch
>>  (both here: http://archives.postgresql.org/pgsql-hackers/2010-03/msg00446.php )
>>
>>   (fix-KnownAssignedXidsRemoveMany-1.patch has been committed, I think?)
>>
>>
>> I use commandlines like this to copy schemas across from 8.4.2 to 9.0devel:
>> pg_dump -c -h /tmp -p 5432 -n myschema --no-owner --no-privileges mydb \
>>   | psql -1qtA -h /tmp -p 7575 -d replicas
>>
>> (the copied schemas were together 175 GB)
>>
>> As I seem to be the only one who finds this, I started looking what could be unique in this
>> install: and it would be postbio, which we use for its gist-indexing on ranges
>> (http://pgfoundry.org/projects/postbio/).  We use postbio's int_interval type as a column type.
>> But keep in mind that sometimes the whole dump+restore+replication completes OK.
>>
>>
>> Other installed modules are:
>>   contrib/btree_gist
>>   contrib/seg
>>   contrib/adminpack
>>
>> log_line_prefix = '%t %p %d %u start=%s ' # slave
>>
>> pgsql.sr_hotslave/logfile:
>>
>> 2010-03-13 23:54:59 CET 15765   start=2010-03-13 23:54:59 CET LOG:  database system was
>> interrupted; last known up at 2010-03-13 23:54:31 CET
>> cp: cannot stat
>> `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000001':
>> No such file or directory
>> 2010-03-13 23:55:00 CET 15765   start=2010-03-13 23:54:59 CET LOG:  entering standby mode
>> 2010-03-13 23:55:00 CET 15765   start=2010-03-13 23:54:59 CET LOG:  redo starts at 0/1000020
>> 2010-03-13 23:55:00 CET 15765   start=2010-03-13 23:54:59 CET LOG:  consistent recovery state
>> reached at 0/2000000
>> 2010-03-13 23:55:00 CET 15763   start=2010-03-13 23:54:59 CET LOG:  database system is ready to
>> accept read only connections
>> TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File: "twophase.c", Line: 1201)
>> 2010-03-14 05:28:59 CET 15763   start=2010-03-13 23:54:59 CET LOG:  startup process (PID 15765)
>> was terminated by signal 6: Aborted
>> 2010-03-14 05:28:59 CET 15763   start=2010-03-13 23:54:59 CET LOG:  terminating any other active
>> server processes
>>
>>
>> Maybe I'll try now to setup a similar instance without postbio, to see if the crash still
>> occurs.
>>
>> hth,
>>
>> Erik Rijkers
>>
>>
>>
>
>
> --
>   Heikki Linnakangas
>   EnterpriseDB   http://www.enterprisedb.com
>




Re: Assertion failure twophase.c (3) (testing HS/SR)

From
Simon Riggs
Date:
On Fri, 2010-04-23 at 03:08 +0200, Erik Rijkers wrote:

> It's a very easy test; I will probably run it a few more times.

Please share details of your system and hardware.

-- Simon Riggs           www.2ndQuadrant.com