Thread: BUG #3362: xlog corruption just after initdb on irix
The following bug has been logged online: Bug reference: 3362 Logged by: Herve Boulouis Email address: amon@sockar.homeip.net PostgreSQL version: 8.2.4 Operating system: IRIX 6.5.26 Description: xlog corruption just after initdb on irix Details: what I did: - compile postgres 8.2.4 (MIPSPro 7.4.4m compiler) export CC=c99 export CXX=CC ./configure --enable-integer-datetimes --with-perl --with-python - compile and install, create pgsql user, ... - initialize db su - pgsql export PGLIB=/usr/local/pgsql/lib export PGDATA=$HOME/data initdb - trigger the bug /etc/init.d/pgsql start /etc/init.d/pgsql stop /etc/init.d/pgsql start /etc/init.d/pgsql stop /etc/init.d/pgsql start <signal 6 on postmaster startup, see logfile> from that point, postmaster won't start unless another initdb is done. - command used by the script to start/stop su - pgsql -c "exec /usr/local/pgsql/bin/pg_ctl start -w -D '/usr/local/pgsql/data'" su - pgsql -c "/usr/local/pgsql/bin/pg_ctl stop -D '/usr/local/pgsql/data' -s -m smart" - logfile of the 5 start/stop operations : * first start Jun 5 15:22:59 6T:origin300 postgres[2007]: [1-1] LOG: could not create IPv6 socket: Address family not supported by protocol family Jun 5 15:22:59 6T:origin300 postgres[2007]: [2-1] LOG: could not create socket for statistics collector: Address family not supported by protocol family Jun 5 15:22:59 6T:origin300 postgres[2007]: [3-1] LOG: trying another address for the statistics collector Jun 5 15:22:59 6T:origin300 postgres[2008]: [4-1] LOG: database system was shut down at 2007-06-05 15:21:59 GMT Jun 5 15:22:59 6T:origin300 postgres[2008]: [5-1] LOG: checkpoint record is at 0/4866C8 Jun 5 15:22:59 6T:origin300 postgres[2008]: [6-1] LOG: redo record is at 0/4866C8; undo record is at 0/0; shutdown TRUE Jun 5 15:22:59 6T:origin300 postgres[2008]: [7-1] LOG: next transaction ID: 0/593; next OID: 10820 Jun 5 15:22:59 6T:origin300 postgres[2008]: [8-1] LOG: next MultiXactId: 1; next MultiXactOffset: 0 Jun 5 15:22:59 6T:origin300 postgres[2008]: [9-1] LOG: database system is ready * first stop Jun 5 15:23:11 6T:origin300 postgres[2007]: [4-1] LOG: received smart shutdown request Jun 5 15:23:11 6T:origin300 postgres[2009]: [4-1] LOG: shutting down Jun 5 15:23:11 6T:origin300 postgres[2009]: [5-1] LOG: database system is shut down * 2d start Jun 5 15:23:17 6T:origin300 postgres[2027]: [1-1] LOG: could not create IPv6 socket: Address family not supported by protocol family Jun 5 15:23:17 6T:origin300 postgres[2027]: [2-1] LOG: could not create socket for statistics collector: Address family not supported by protocol family Jun 5 15:23:17 6T:origin300 postgres[2027]: [3-1] LOG: trying another address for the statistics collector Jun 5 15:23:17 6T:origin300 postgres[2028]: [4-1] LOG: database system was shut down at 2007-06-05 15:23:11 GMT Jun 5 15:23:17 6T:origin300 postgres[2028]: [5-1] LOG: record with zero length at 0/486718 * vvv I believe this line is not normal, as there was a clean shutdown Jun 5 15:23:17 6T:origin300 postgres[2028]: [6-1] LOG: invalid primary checkpoint record Jun 5 15:23:17 6T:origin300 postgres[2028]: [7-1] LOG: using previous checkpoint record at 0/4866C8 Jun 5 15:23:17 6T:origin300 postgres[2028]: [8-1] LOG: redo record is at 0/4866C8; undo record is at 0/0; shutdown TRUE Jun 5 15:23:17 6T:origin300 postgres[2028]: [9-1] LOG: next transaction ID: 0/593; next OID: 10820 Jun 5 15:23:17 6T:origin300 postgres[2028]: [10-1] LOG: next MultiXactId: 1; next MultiXactOffset: 0 Jun 5 15:23:17 6T:origin300 postgres[2028]: [11-1] LOG: database system was not properly shut down; automatic recovery in progress Jun 5 15:23:17 6T:origin300 postgres[2028]: [12-1] LOG: record with zero length at 0/486718 Jun 5 15:23:17 6T:origin300 postgres[2028]: [13-1] LOG: redo is not required Jun 5 15:23:17 6T:origin300 postgres[2028]: [14-1] LOG: database system is ready * 2d stop Jun 5 15:23:22 6T:origin300 postgres[2027]: [4-1] LOG: received smart shutdown request Jun 5 15:23:22 6T:origin300 postgres[2029]: [4-1] LOG: shutting down Jun 5 15:23:22 6T:origin300 postgres[2029]: [5-1] LOG: database system is shut down * 3d start Jun 5 15:23:25 6T:origin300 postgres[2047]: [1-1] LOG: could not create IPv6 socket: Address family not supported by protocol family Jun 5 15:23:25 6T:origin300 postgres[2047]: [2-1] LOG: could not create socket for statistics collector: Address family not supported by protocol family Jun 5 15:23:25 6T:origin300 postgres[2047]: [3-1] LOG: trying another address for the statistics collector Jun 5 15:23:25 6T:origin300 postgres[2048]: [4-1] LOG: database system was shut down at 2007-06-05 15:23:22 GMT Jun 5 15:23:25 6T:origin300 postgres[2048]: [5-1] LOG: record with zero length at 0/486768 Jun 5 15:23:25 6T:origin300 postgres[2048]: [6-1] LOG: invalid primary checkpoint record Jun 5 15:23:25 6T:origin300 postgres[2048]: [7-1] LOG: record with zero length at 0/486718 Jun 5 15:23:25 6T:origin300 postgres[2048]: [8-1] LOG: invalid secondary checkpoint record * vvv this it gets worse Jun 5 15:23:25 2T:origin300 postgres[2048]: [9-1] PANIC: could not locate a valid checkpoint record Jun 5 15:23:26 6T:origin300 postgres[2047]: [4-1] LOG: startup process (PID 2048) was terminated by signal 6 Jun 5 15:23:26 6T:origin300 postgres[2047]: [5-1] LOG: aborting startup due to startup process failure I had already noticed that problem on this platform with 8.0 or 8.1, but didn't report it at the time. I kept the core file and here is the backtrace : origin300:/usr/local/pgsql/data# dbx /usr/local/pgsql/bin/postgres core dbx version 7.3.4 (86441_Nov11 MR) Nov 11 2002 11:31:55 Core from signal SIGABRT: Abort (see abort(3c)) (dbx) where > 0 _kill(0x800, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0) ["/xlv41/6.5.26m/work/irix/lib/libc/libc_n32_M4/signal/kill.s":15, 0xfa54258] 1 _raise(0x800, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0) ["/xlv41/6.5.26m/work/irix/lib/libc/libc_n32_M4/signal/raise.c":27, 0xfad16dc] 2 abort(0x800, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0) ["/xlv41/6.5.26m/work/irix/lib/libc/libc_n32_M4/gen/abort.c":52, 0xfa6ef60] 3 errfinish(0x0, 0x0, 0x3, 0x5, 0x40000, 0x0, 0x68a1, 0x0) ["/opt/install/postgresql-8.2.4/src/backend/utils/error/elog.c":451, 0x102d9678] 4 StartupXLOG(0x800, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0) ["/opt/install/postgresql-8.2.4/src/backend/access/transam/xlog.c":4776, 0x10088638] 5 BootstrapMain(0x103b7524, 0x2, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0) ["/opt/install/postgresql-8.2.4/src/backend/bootstrap/bootstrap.c":423, 0x100a3cd4] 6 StartChildProcess(0x2, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0) ["/opt/install/postgresql-8.2.4/src/backend/postmaster/postmaster.c":3691, 0x101e6de8] 7 PostmasterMain(0x800, 0xfa49bf0, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0) ["/opt/install/postgresql-8.2.4/src/backend/postmaster/postmaster.c":961, 0x101e267c] 8 main(0x3, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0) ["/opt/install/postgresql-8.2.4/src/backend/main/main.c":188, 0x1017e684] 9 __start() ["/xlv55/kudzu-apr12/work/irix/lib/libc/libc_n32_M4/csu/crt1text.s":177, 0x1003f2a8] If necessary, I can provide access to an irix machine if some pgsql developpers need it to test/maintain the irix port.
"Herve Boulouis" <amon@sockar.homeip.net> writes: > - trigger the bug > /etc/init.d/pgsql start > /etc/init.d/pgsql stop > /etc/init.d/pgsql start > /etc/init.d/pgsql stop > /etc/init.d/pgsql start > <signal 6 on postmaster startup, see logfile> > from that point, postmaster won't start unless another initdb is done. That's pretty dang weird. It looks like it's failing to write the shutdown checkpoint record (but is updating pg_control anyway). But if that's the explanation then it should have failed during initdb, because each step of initdb goes through the start/stop process. > I had already noticed that problem on this platform with 8.0 or 8.1, but > didn't report it at the time. You should have mentioned it earlier :-( ... whatever the problem is is surely now long-forgotten. Although I see we have a report of 8.1 working with IRIX 6.5, so apparently it works for some people. Are you sure your compiler is up-to-date? > If necessary, I can provide access to an irix machine if some pgsql > developpers need it to test/maintain the irix port. I'm willing to take a look, but first please see if you can eliminate the compiler-bug theory. Check to see if there's a newer compiler version available; check whether building with CFLAGS=-O0 makes the problem go away. regards, tom lane
On Tue, 2007-06-05 at 14:42 +0000, Herve Boulouis wrote: > PostgreSQL version: 8.2.4 > Operating system: IRIX 6.5.26 Thats not yet a supported system, but we should fix this. > Jun 5 15:23:11 6T:origin300 postgres[2009]: [5-1] LOG: database system is > shut down > * 2d start > Jun 5 15:23:17 6T:origin300 postgres[2027]: [1-1] LOG: could not create > IPv6 socket: Address family not supported by protocol family > Jun 5 15:23:17 6T:origin300 postgres[2027]: [2-1] LOG: could not create > socket for statistics collector: Address family not supported by protocol > family > Jun 5 15:23:17 6T:origin300 postgres[2027]: [3-1] LOG: trying another > address for the statistics collector > Jun 5 15:23:17 6T:origin300 postgres[2028]: [4-1] LOG: database system was > shut down at 2007-06-05 15:23:11 GMT > Jun 5 15:23:17 6T:origin300 postgres[2028]: [5-1] LOG: record with zero > length at 0/486718 > * vvv I believe this line is not normal, as there was a clean shutdown > Jun 5 15:23:17 6T:origin300 postgres[2028]: [6-1] LOG: invalid primary > checkpoint record > Jun 5 15:23:17 6T:origin300 postgres[2028]: [7-1] LOG: using previous > checkpoint record at 0/4866C8 > Jun 5 15:23:17 6T:origin300 postgres[2028]: [8-1] LOG: redo record is at > 0/4866C8; undo record is at 0/0; shutdown TRUE > Jun 5 15:23:17 6T:origin300 postgres[2028]: [9-1] LOG: next transaction > ID: 0/593; next OID: 10820 > Jun 5 15:23:17 6T:origin300 postgres[2028]: [10-1] LOG: next MultiXactId: > 1; next MultiXactOffset: 0 > Jun 5 15:23:17 6T:origin300 postgres[2028]: [11-1] LOG: database system > was not properly shut down; automatic recovery in progress > Jun 5 15:23:17 6T:origin300 postgres[2028]: [12-1] LOG: record with zero > length at 0/486718 > Jun 5 15:23:17 6T:origin300 postgres[2028]: [13-1] LOG: redo is not > required > Jun 5 15:23:17 6T:origin300 postgres[2028]: [14-1] LOG: database system is > ready Well, I'd question what's going on with the hardware/filesystem or the settings you've chosen for fsync and wal_fsync. If you shut it down cleanly, yet its flaky when it comes back up that doesn't sound too much like a database problem to me... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Le 05/06/2007 17:33, Simon Riggs a écrit: > On Tue, 2007-06-05 at 14:42 +0000, Herve Boulouis wrote: > > Well, I'd question what's going on with the hardware/filesystem or the > settings you've chosen for fsync and wal_fsync. > > If you shut it down cleanly, yet its flaky when it comes back up that > doesn't sound too much like a database problem to me... I should have precised that I have "fsync = on" in my postgresql.conf. Pretty much everything else is default value. Regards, -- Herve Boulouis
Le 05/06/2007 12:06, Tom Lane a écrit: > > > I had already noticed that problem on this platform with 8.0 or 8.1, but > > didn't report it at the time. > > You should have mentioned it earlier :-( ... whatever the problem is > is surely now long-forgotten. Although I see we have a report of 8.1 > working with IRIX 6.5, so apparently it works for some people. Are you > sure your compiler is up-to-date? Yep, MIPSPro 7.4.4m is the latest available and probably the last one :( > > If necessary, I can provide access to an irix machine if some pgsql > > developpers need it to test/maintain the irix port. > > I'm willing to take a look, but first please see if you can eliminate > the compiler-bug theory. Check to see if there's a newer compiler > version available; check whether building with CFLAGS=-O0 makes the > problem go away. I'll try without optimizations. Regards, -- Herve Boulouis
Herve Boulouis wrote: > Le 05/06/2007 17:33, Simon Riggs a écrit: >> On Tue, 2007-06-05 at 14:42 +0000, Herve Boulouis wrote: >> >> Well, I'd question what's going on with the hardware/filesystem or the >> settings you've chosen for fsync and wal_fsync. >> >> If you shut it down cleanly, yet its flaky when it comes back up that >> doesn't sound too much like a database problem to me... > > I should have precised that I have "fsync = on" in my postgresql.conf. > Pretty much everything else is default value. Do you mean "fsync=off"? "on" is the default. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Le 06/06/2007 08:24, Heikki Linnakangas a écrit: > Herve Boulouis wrote: > >Le 05/06/2007 17:33, Simon Riggs a ?crit: > >>On Tue, 2007-06-05 at 14:42 +0000, Herve Boulouis wrote: > >> > >>Well, I'd question what's going on with the hardware/filesystem or the > >>settings you've chosen for fsync and wal_fsync. > >> > >>If you shut it down cleanly, yet its flaky when it comes back up that > >>doesn't sound too much like a database problem to me... > > > >I should have precised that I have "fsync = on" in my postgresql.conf. > >Pretty much everything else is default value. > > Do you mean "fsync=off"? "on" is the default. No, I mean "fsync=on", everything else is commented out so I assume default values. -- Herve Boulouis
Le 05/06/2007 12:06, Tom Lane a écrit: > > I'm willing to take a look, but first please see if you can eliminate > the compiler-bug theory. Check to see if there's a newer compiler > version available; check whether building with CFLAGS=-O0 makes the > problem go away. quick question : do you think that the --enable-integer-datetimes configure flag could have an impact on this problem ? -- Herve Boulouis
Herve Boulouis <amon@sockar.homeip.net> writes: > quick question : do you think that the --enable-integer-datetimes configure > flag could have an impact on this problem ? I doubt it. regards, tom lane
Herve Boulouis <amon@sockar.homeip.net> writes: > The mipspro compiler is a proprietary product and I cannot get the > necessary updates so for now I am unable to verify this theory. Anyway, > maybe you should a add a note in the pg irix faq to use the 7.4.4m version > of mipspro when building the sources ? Maybe, but I'd like confirmation of the theory first. Please let me know how it goes when you get hold of 7.4.4m ... regards, tom lane
Le 05/06/2007 12:06, Tom Lane a écrit: > > I'm willing to take a look, but first please see if you can eliminate > the compiler-bug theory. Check to see if there's a newer compiler > version available; check whether building with CFLAGS=-O0 makes the > problem go away. I tried -O0 with same results but I've found something more interesting : It seems that my compiler version is not 7.4.4m but 7.4.1m, I got lost in the sgi compiler upgrades. As the folks at nekochan.net provide a binary package of postgresql for irix that works correctly, I tried compiling with their build flags but the problem is still there. I asked the package maintainer which compiler version they use and it's 7.4.4m. So our problem could very well be a compiler bug. The mipspro compiler is a proprietary product and I cannot get the necessary updates so for now I am unable to verify this theory. Anyway, maybe you should a add a note in the pg irix faq to use the 7.4.4m version of mipspro when building the sources ? -- Herve Boulouis
Le 06/06/2007 18:04, Tom Lane a écrit: Hi, > Herve Boulouis <amon@sockar.homeip.net> writes: > > The mipspro compiler is a proprietary product and I cannot get the > > necessary updates so for now I am unable to verify this theory. Anyway, > > maybe you should a add a note in the pg irix faq to use the 7.4.4m version > > of mipspro when building the sources ? > > Maybe, but I'd like confirmation of the theory first. Please let me > know how it goes when you get hold of 7.4.4m ... I finally got the 7.4.4m update and repeated the entire test (compile, install, initdb, start/stop multiple times) and the problem has gone away. My test server hasn't changed since I reported the bug, so I think you can add a note in the pg documentation regarding the MIPSPro version to use. (7.4.4m ok, 7.4.1m broken, dunno about the intermediate versions) -- Herve Boulouis
Herve Boulouis <amon@sockar.homeip.net> writes: > I finally got the 7.4.4m update and repeated the entire test (compile, > install, initdb, start/stop multiple times) and the problem has gone away. > My test server hasn't changed since I reported the bug, so I think you > can add a note in the pg documentation regarding the MIPSPro version to use. > (7.4.4m ok, 7.4.1m broken, dunno about the intermediate versions) Done, thanks for the update! regards, tom lane