Thread: BUG #3362: xlog corruption just after initdb on irix

BUG #3362: xlog corruption just after initdb on irix

From
"Herve Boulouis"
Date:
The following bug has been logged online:

Bug reference:      3362
Logged by:          Herve Boulouis
Email address:      amon@sockar.homeip.net
PostgreSQL version: 8.2.4
Operating system:   IRIX 6.5.26
Description:        xlog corruption just after initdb on irix
Details:

what I did:

- compile postgres 8.2.4 (MIPSPro 7.4.4m compiler)

export CC=c99
export CXX=CC
./configure --enable-integer-datetimes --with-perl --with-python

- compile and install, create pgsql user, ...

- initialize db

su - pgsql
export PGLIB=/usr/local/pgsql/lib
export PGDATA=$HOME/data
initdb

- trigger the bug

/etc/init.d/pgsql start
/etc/init.d/pgsql stop
/etc/init.d/pgsql start
/etc/init.d/pgsql stop
/etc/init.d/pgsql start
<signal 6 on postmaster startup, see logfile>

from that point, postmaster won't start unless another initdb is done.

- command used by the script to start/stop

su - pgsql -c "exec /usr/local/pgsql/bin/pg_ctl start -w -D
'/usr/local/pgsql/data'"
su - pgsql -c "/usr/local/pgsql/bin/pg_ctl stop -D '/usr/local/pgsql/data'
-s -m smart"

- logfile of the 5 start/stop operations :

* first start
Jun  5 15:22:59 6T:origin300 postgres[2007]: [1-1] LOG:  could not create
IPv6 socket: Address family not supported by protocol family
Jun  5 15:22:59 6T:origin300 postgres[2007]: [2-1] LOG:  could not create
socket for statistics collector: Address family not supported by protocol
family
Jun  5 15:22:59 6T:origin300 postgres[2007]: [3-1] LOG:  trying another
address for the statistics collector
Jun  5 15:22:59 6T:origin300 postgres[2008]: [4-1] LOG:  database system was
shut down at 2007-06-05 15:21:59 GMT
Jun  5 15:22:59 6T:origin300 postgres[2008]: [5-1] LOG:  checkpoint record
is at 0/4866C8
Jun  5 15:22:59 6T:origin300 postgres[2008]: [6-1] LOG:  redo record is at
0/4866C8; undo record is at 0/0; shutdown TRUE
Jun  5 15:22:59 6T:origin300 postgres[2008]: [7-1] LOG:  next transaction
ID: 0/593; next OID: 10820
Jun  5 15:22:59 6T:origin300 postgres[2008]: [8-1] LOG:  next MultiXactId:
1; next MultiXactOffset: 0
Jun  5 15:22:59 6T:origin300 postgres[2008]: [9-1] LOG:  database system is
ready
* first stop
Jun  5 15:23:11 6T:origin300 postgres[2007]: [4-1] LOG:  received smart
shutdown request
Jun  5 15:23:11 6T:origin300 postgres[2009]: [4-1] LOG:  shutting down
Jun  5 15:23:11 6T:origin300 postgres[2009]: [5-1] LOG:  database system is
shut down
* 2d start
Jun  5 15:23:17 6T:origin300 postgres[2027]: [1-1] LOG:  could not create
IPv6 socket: Address family not supported by protocol family
Jun  5 15:23:17 6T:origin300 postgres[2027]: [2-1] LOG:  could not create
socket for statistics collector: Address family not supported by protocol
family
Jun  5 15:23:17 6T:origin300 postgres[2027]: [3-1] LOG:  trying another
address for the statistics collector
Jun  5 15:23:17 6T:origin300 postgres[2028]: [4-1] LOG:  database system was
shut down at 2007-06-05 15:23:11 GMT
Jun  5 15:23:17 6T:origin300 postgres[2028]: [5-1] LOG:  record with zero
length at 0/486718
* vvv I believe this line is not normal, as there was a clean shutdown
Jun  5 15:23:17 6T:origin300 postgres[2028]: [6-1] LOG:  invalid primary
checkpoint record
Jun  5 15:23:17 6T:origin300 postgres[2028]: [7-1] LOG:  using previous
checkpoint record at 0/4866C8
Jun  5 15:23:17 6T:origin300 postgres[2028]: [8-1] LOG:  redo record is at
0/4866C8; undo record is at 0/0; shutdown TRUE
Jun  5 15:23:17 6T:origin300 postgres[2028]: [9-1] LOG:  next transaction
ID: 0/593; next OID: 10820
Jun  5 15:23:17 6T:origin300 postgres[2028]: [10-1] LOG:  next MultiXactId:
1; next MultiXactOffset: 0
Jun  5 15:23:17 6T:origin300 postgres[2028]: [11-1] LOG:  database system
was not properly shut down; automatic recovery in progress
Jun  5 15:23:17 6T:origin300 postgres[2028]: [12-1] LOG:  record with zero
length at 0/486718
Jun  5 15:23:17 6T:origin300 postgres[2028]: [13-1] LOG:  redo is not
required
Jun  5 15:23:17 6T:origin300 postgres[2028]: [14-1] LOG:  database system is
ready
* 2d stop
Jun  5 15:23:22 6T:origin300 postgres[2027]: [4-1] LOG:  received smart
shutdown request
Jun  5 15:23:22 6T:origin300 postgres[2029]: [4-1] LOG:  shutting down
Jun  5 15:23:22 6T:origin300 postgres[2029]: [5-1] LOG:  database system is
shut down
* 3d start
Jun  5 15:23:25 6T:origin300 postgres[2047]: [1-1] LOG:  could not create
IPv6 socket: Address family not supported by protocol family
Jun  5 15:23:25 6T:origin300 postgres[2047]: [2-1] LOG:  could not create
socket for statistics collector: Address family not supported by protocol
family
Jun  5 15:23:25 6T:origin300 postgres[2047]: [3-1] LOG:  trying another
address for the statistics collector
Jun  5 15:23:25 6T:origin300 postgres[2048]: [4-1] LOG:  database system was
shut down at 2007-06-05 15:23:22 GMT
Jun  5 15:23:25 6T:origin300 postgres[2048]: [5-1] LOG:  record with zero
length at 0/486768
Jun  5 15:23:25 6T:origin300 postgres[2048]: [6-1] LOG:  invalid primary
checkpoint record
Jun  5 15:23:25 6T:origin300 postgres[2048]: [7-1] LOG:  record with zero
length at 0/486718
Jun  5 15:23:25 6T:origin300 postgres[2048]: [8-1] LOG:  invalid secondary
checkpoint record
* vvv this it gets worse
Jun  5 15:23:25 2T:origin300 postgres[2048]: [9-1] PANIC:  could not locate
a valid checkpoint record
Jun  5 15:23:26 6T:origin300 postgres[2047]: [4-1] LOG:  startup process
(PID 2048) was terminated by signal 6
Jun  5 15:23:26 6T:origin300 postgres[2047]: [5-1] LOG:  aborting startup
due to startup process failure

I had already noticed that problem on this platform with 8.0 or 8.1, but
didn't report it at the time.

I kept the core file and here is the backtrace :

origin300:/usr/local/pgsql/data# dbx /usr/local/pgsql/bin/postgres core
dbx version 7.3.4 (86441_Nov11 MR) Nov 11 2002 11:31:55
Core from signal SIGABRT: Abort (see abort(3c))
(dbx) where
>  0 _kill(0x800, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0)
["/xlv41/6.5.26m/work/irix/lib/libc/libc_n32_M4/signal/kill.s":15,
0xfa54258]
   1 _raise(0x800, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0)
["/xlv41/6.5.26m/work/irix/lib/libc/libc_n32_M4/signal/raise.c":27,
0xfad16dc]
   2 abort(0x800, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0)
["/xlv41/6.5.26m/work/irix/lib/libc/libc_n32_M4/gen/abort.c":52, 0xfa6ef60]
   3 errfinish(0x0, 0x0, 0x3, 0x5, 0x40000, 0x0, 0x68a1, 0x0)
["/opt/install/postgresql-8.2.4/src/backend/utils/error/elog.c":451,
0x102d9678]
   4 StartupXLOG(0x800, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0)
["/opt/install/postgresql-8.2.4/src/backend/access/transam/xlog.c":4776,
0x10088638]
   5 BootstrapMain(0x103b7524, 0x2, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0)
["/opt/install/postgresql-8.2.4/src/backend/bootstrap/bootstrap.c":423,
0x100a3cd4]
   6 StartChildProcess(0x2, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0)
["/opt/install/postgresql-8.2.4/src/backend/postmaster/postmaster.c":3691,
0x101e6de8]
   7 PostmasterMain(0x800, 0xfa49bf0, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1,
0x0)
["/opt/install/postgresql-8.2.4/src/backend/postmaster/postmaster.c":961,
0x101e267c]
   8 main(0x3, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0)
["/opt/install/postgresql-8.2.4/src/backend/main/main.c":188, 0x1017e684]
   9 __start()
["/xlv55/kudzu-apr12/work/irix/lib/libc/libc_n32_M4/csu/crt1text.s":177,
0x1003f2a8]

If necessary, I can provide access to an irix machine if some pgsql
developpers need it to test/maintain the irix port.

Re: BUG #3362: xlog corruption just after initdb on irix

From
Tom Lane
Date:
"Herve Boulouis" <amon@sockar.homeip.net> writes:
> - trigger the bug

> /etc/init.d/pgsql start
> /etc/init.d/pgsql stop
> /etc/init.d/pgsql start
> /etc/init.d/pgsql stop
> /etc/init.d/pgsql start
> <signal 6 on postmaster startup, see logfile>

> from that point, postmaster won't start unless another initdb is done.

That's pretty dang weird.  It looks like it's failing to write the
shutdown checkpoint record (but is updating pg_control anyway).  But if
that's the explanation then it should have failed during initdb, because
each step of initdb goes through the start/stop process.

> I had already noticed that problem on this platform with 8.0 or 8.1, but
> didn't report it at the time.

You should have mentioned it earlier :-( ... whatever the problem is
is surely now long-forgotten.  Although I see we have a report of 8.1
working with IRIX 6.5, so apparently it works for some people.  Are you
sure your compiler is up-to-date?

> If necessary, I can provide access to an irix machine if some pgsql
> developpers need it to test/maintain the irix port.

I'm willing to take a look, but first please see if you can eliminate
the compiler-bug theory.  Check to see if there's a newer compiler
version available; check whether building with CFLAGS=-O0 makes the
problem go away.

            regards, tom lane

Re: BUG #3362: xlog corruption just after initdb on irix

From
"Simon Riggs"
Date:
On Tue, 2007-06-05 at 14:42 +0000, Herve Boulouis wrote:

> PostgreSQL version: 8.2.4
> Operating system:   IRIX 6.5.26

Thats not yet a supported system, but we should fix this.

> Jun  5 15:23:11 6T:origin300 postgres[2009]: [5-1] LOG:  database system is
> shut down
> * 2d start
> Jun  5 15:23:17 6T:origin300 postgres[2027]: [1-1] LOG:  could not create
> IPv6 socket: Address family not supported by protocol family
> Jun  5 15:23:17 6T:origin300 postgres[2027]: [2-1] LOG:  could not create
> socket for statistics collector: Address family not supported by protocol
> family
> Jun  5 15:23:17 6T:origin300 postgres[2027]: [3-1] LOG:  trying another
> address for the statistics collector
> Jun  5 15:23:17 6T:origin300 postgres[2028]: [4-1] LOG:  database system was
> shut down at 2007-06-05 15:23:11 GMT
> Jun  5 15:23:17 6T:origin300 postgres[2028]: [5-1] LOG:  record with zero
> length at 0/486718
> * vvv I believe this line is not normal, as there was a clean shutdown
> Jun  5 15:23:17 6T:origin300 postgres[2028]: [6-1] LOG:  invalid primary
> checkpoint record
> Jun  5 15:23:17 6T:origin300 postgres[2028]: [7-1] LOG:  using previous
> checkpoint record at 0/4866C8
> Jun  5 15:23:17 6T:origin300 postgres[2028]: [8-1] LOG:  redo record is at
> 0/4866C8; undo record is at 0/0; shutdown TRUE
> Jun  5 15:23:17 6T:origin300 postgres[2028]: [9-1] LOG:  next transaction
> ID: 0/593; next OID: 10820
> Jun  5 15:23:17 6T:origin300 postgres[2028]: [10-1] LOG:  next MultiXactId:
> 1; next MultiXactOffset: 0
> Jun  5 15:23:17 6T:origin300 postgres[2028]: [11-1] LOG:  database system
> was not properly shut down; automatic recovery in progress
> Jun  5 15:23:17 6T:origin300 postgres[2028]: [12-1] LOG:  record with zero
> length at 0/486718
> Jun  5 15:23:17 6T:origin300 postgres[2028]: [13-1] LOG:  redo is not
> required
> Jun  5 15:23:17 6T:origin300 postgres[2028]: [14-1] LOG:  database system is
> ready

Well, I'd question what's going on with the hardware/filesystem or the
settings you've chosen for fsync and wal_fsync.

If you shut it down cleanly, yet its flaky when it comes back up that
doesn't sound too much like a database problem to me...

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #3362: xlog corruption just after initdb on irix

From
Herve Boulouis
Date:
Le 05/06/2007  17:33, Simon Riggs a écrit:
> On Tue, 2007-06-05 at 14:42 +0000, Herve Boulouis wrote:
>
> Well, I'd question what's going on with the hardware/filesystem or the
> settings you've chosen for fsync and wal_fsync.
>
> If you shut it down cleanly, yet its flaky when it comes back up that
> doesn't sound too much like a database problem to me...

I should have precised that I have "fsync = on" in my postgresql.conf.
Pretty much everything else is default value.

Regards,

--
Herve Boulouis

Re: BUG #3362: xlog corruption just after initdb on irix

From
Herve Boulouis
Date:
Le 05/06/2007  12:06, Tom Lane a écrit:
>
> > I had already noticed that problem on this platform with 8.0 or 8.1, but
> > didn't report it at the time.
>
> You should have mentioned it earlier :-( ... whatever the problem is
> is surely now long-forgotten.  Although I see we have a report of 8.1
> working with IRIX 6.5, so apparently it works for some people.  Are you
> sure your compiler is up-to-date?

Yep, MIPSPro 7.4.4m is the latest available and probably the last one :(

> > If necessary, I can provide access to an irix machine if some pgsql
> > developpers need it to test/maintain the irix port.
>
> I'm willing to take a look, but first please see if you can eliminate
> the compiler-bug theory.  Check to see if there's a newer compiler
> version available; check whether building with CFLAGS=-O0 makes the
> problem go away.

I'll try without optimizations.

Regards,

--
Herve Boulouis

Re: BUG #3362: xlog corruption just after initdb on irix

From
Heikki Linnakangas
Date:
Herve Boulouis wrote:
> Le 05/06/2007  17:33, Simon Riggs a écrit:
>> On Tue, 2007-06-05 at 14:42 +0000, Herve Boulouis wrote:
>>
>> Well, I'd question what's going on with the hardware/filesystem or the
>> settings you've chosen for fsync and wal_fsync.
>>
>> If you shut it down cleanly, yet its flaky when it comes back up that
>> doesn't sound too much like a database problem to me...
>
> I should have precised that I have "fsync = on" in my postgresql.conf.
> Pretty much everything else is default value.

Do you mean "fsync=off"? "on" is the default.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: BUG #3362: xlog corruption just after initdb on irix

From
Herve Boulouis
Date:
Le 06/06/2007  08:24, Heikki Linnakangas a écrit:
> Herve Boulouis wrote:
> >Le 05/06/2007  17:33, Simon Riggs a ?crit:
> >>On Tue, 2007-06-05 at 14:42 +0000, Herve Boulouis wrote:
> >>
> >>Well, I'd question what's going on with the hardware/filesystem or the
> >>settings you've chosen for fsync and wal_fsync.
> >>
> >>If you shut it down cleanly, yet its flaky when it comes back up that
> >>doesn't sound too much like a database problem to me...
> >
> >I should have precised that I have "fsync = on" in my postgresql.conf.
> >Pretty much everything else is default value.
>
> Do you mean "fsync=off"? "on" is the default.

No, I mean "fsync=on", everything else is commented out so I assume default
values.

--
Herve Boulouis

Re: BUG #3362: xlog corruption just after initdb on irix

From
Herve Boulouis
Date:
Le 05/06/2007  12:06, Tom Lane a écrit:
>
> I'm willing to take a look, but first please see if you can eliminate
> the compiler-bug theory.  Check to see if there's a newer compiler
> version available; check whether building with CFLAGS=-O0 makes the
> problem go away.

quick question : do you think that the  --enable-integer-datetimes configure
flag could have an impact on this problem ?

--
Herve Boulouis

Re: BUG #3362: xlog corruption just after initdb on irix

From
Tom Lane
Date:
Herve Boulouis <amon@sockar.homeip.net> writes:
> quick question : do you think that the  --enable-integer-datetimes configure
> flag could have an impact on this problem ?

I doubt it.

            regards, tom lane

Re: BUG #3362: xlog corruption just after initdb on irix

From
Tom Lane
Date:
Herve Boulouis <amon@sockar.homeip.net> writes:
> The mipspro compiler is a proprietary product and I cannot get the
> necessary updates so for now I am unable to verify this theory. Anyway,
> maybe you should a add a note in the pg irix faq to use the 7.4.4m version
> of mipspro when building the sources ?

Maybe, but I'd like confirmation of the theory first.  Please let me
know how it goes when you get hold of 7.4.4m ...

            regards, tom lane

Re: BUG #3362: xlog corruption just after initdb on irix

From
Herve Boulouis
Date:
Le 05/06/2007  12:06, Tom Lane a écrit:
>
> I'm willing to take a look, but first please see if you can eliminate
> the compiler-bug theory.  Check to see if there's a newer compiler
> version available; check whether building with CFLAGS=-O0 makes the
> problem go away.

I tried -O0 with same results but I've found something more interesting :

It seems that my compiler version is not 7.4.4m but 7.4.1m, I got lost
in the sgi compiler upgrades. As the folks at nekochan.net provide a binary
package of postgresql for irix that works correctly, I tried compiling
with their build flags but the problem is still there. I asked the
package maintainer which compiler version they use and it's 7.4.4m. So
our problem could very well be a compiler bug.

The mipspro compiler is a proprietary product and I cannot get the
necessary updates so for now I am unable to verify this theory. Anyway,
maybe you should a add a note in the pg irix faq to use the 7.4.4m version
of mipspro when building the sources ?

--
Herve Boulouis

Re: BUG #3362: xlog corruption just after initdb on irix

From
Herve Boulouis
Date:
Le 06/06/2007  18:04, Tom Lane a écrit:

Hi,

> Herve Boulouis <amon@sockar.homeip.net> writes:
> > The mipspro compiler is a proprietary product and I cannot get the
> > necessary updates so for now I am unable to verify this theory. Anyway,
> > maybe you should a add a note in the pg irix faq to use the 7.4.4m version
> > of mipspro when building the sources ?
>
> Maybe, but I'd like confirmation of the theory first.  Please let me
> know how it goes when you get hold of 7.4.4m ...

I finally got the 7.4.4m update and repeated the entire test (compile,
install, initdb, start/stop multiple times) and the problem has gone away.

My test server hasn't changed since I reported the bug, so I think you
can add a note in the pg documentation regarding the MIPSPro version to use.
(7.4.4m ok, 7.4.1m broken, dunno about the intermediate versions)

--
Herve Boulouis

Re: BUG #3362: xlog corruption just after initdb on irix

From
Tom Lane
Date:
Herve Boulouis <amon@sockar.homeip.net> writes:
> I finally got the 7.4.4m update and repeated the entire test (compile,
> install, initdb, start/stop multiple times) and the problem has gone away.

> My test server hasn't changed since I reported the bug, so I think you
> can add a note in the pg documentation regarding the MIPSPro version to use.
> (7.4.4m ok, 7.4.1m broken, dunno about the intermediate versions)

Done, thanks for the update!

            regards, tom lane