Thread: make check hangs in alpha5
* Environment: CentOS 5.2 on VirtualBox 3.0.6 of WindowsXP SP3 $ uname -a Linux localhost.localdomain 2.6.18-92.1.22.el5 #1 SMP Tue Dec 16 12:03:43 EST 2008 i686 i686 i386 GNU/Linux $ gcc -v Target: i386-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=i386-redhat-linux Thread model: posix gcc version 4.1.2 20071124 (Red Hat 4.1.2-42) * Fact: In Alpha4, Alpha5 and some snapshots from about middle of Jan, "make check" hangs with an error ./pg_regress --inputdir=. --dlpath=. --multibyte=SQL_ASCII --temp-install=./tmp_check --top-builddir=../../.. --schedule=./parallel_schedule ============== creating temporary installation ============== ============== initializing database system ============== ============== starting postmaster ============== pg_regress: postmaster did not respond within 60 seconds Examine /home/forcia/work/postgresql/src/test/regress/log/postmaster.log for the reason make[2]: *** [check] Error 2 make[2]: Leaving directory `/home/forcia/work/postgresql/src/test/regress' make[1]: *** [check] Error 2 make[1]: Leaving directory `/home/forcia/work/postgresql/src/test' make: *** [check] Error 2 $ cat src/test/regress/log/postmaster.log LOG: database system was shut down at 2010-04-04 16:35:37 JST LOG: autovacuum launcher started LOG: database system is ready to accept connections == end of log == Attached is another log with debug5. I run "make maintainer-clean" every time before "make check". $ ./configure CFLAGS= --enable-debug --enable-cassert It seems postgres starts up successfully with aux processes but doesn't accept neither UNIX domain socket / TCP socket. When I run postgres from console with the same option that pg_regress forks it accepts. This occurs only when pg_regress spawns postgres process. * My investigation so far: My bisect-ish (manual, actually :) research found that the failure begins with the commit: <Introduce two new libpq connection functions, PQconnectdbParams and> http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=cc6c533b43b5244cf4af58709340bb7b9af26cce But this commit is OK: <Fix bug in wasender's xlogid boundary handling, reported by Erik Rijkers.> http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=c46f3447f1efbc9fbc30ce5d5b9c65bd75e10b89 The problem isn't in libpq, since it is that the server doesn't listen on startup as above. No /tmp/.s.PGSQL.5432, nor LISTENING entry in netstat. And not only the target version psql but 8.4's psql cannot connect to the database. I cannot figure out at all what is wrong. Have any idea? Regards, -- Hitoshi Harada
Attachment
Hitoshi Harada <umi.tanuki@gmail.com> writes: > I cannot figure out at all what is wrong. Have any idea? Since nobody else is reporting this, it seems like it must be either something messed up about your system, or something wrong with your copy of the PG sources. In the latter connection I confess to having little confidence in the git mirror setup. Have you tried diffing your source tree against a CVS checkout or nightly snapshot tarball? A different line of attack is to insert some debugging logging into the postmaster's port-opening code to see if you can find why it's apparently not doing anything. regards, tom lane
Hitoshi Harada wrote: > The problem isn't in libpq, since it is that the server doesn't listen > on startup as above. No /tmp/.s.PGSQL.5432, nor LISTENING entry in > netstat. I find this somewhat implausible. The postmaster has this code that makes it die if it can't open a listening socket: if (ListenSocket[0] == PGINVALID_SOCKET) ereport(FATAL, (errmsg("no socket created forlistening"))); Perhaps you could strace the execution. Of course, pg_regress doesn't usually run on port 5432, IIRC, so maybe you're looking for the wrong thing. Another question worth asking is whether or not either SELinux or firewall settings on your CentOS box are interfering with connectivity. cheers andrew
On 4/4/10 9:19 AM, Tom Lane wrote: > Hitoshi Harada <umi.tanuki@gmail.com> writes: >> I cannot figure out at all what is wrong. Have any idea? We ran Make Check on 4 linux laptops and 3 macs yesterday, and did not see a hang. Could this be an issue with VirtualBox? Have you used this VM for testing before? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
2010/4/5 Tom Lane <tgl@sss.pgh.pa.us>: > Hitoshi Harada <umi.tanuki@gmail.com> writes: >> I cannot figure out at all what is wrong. Have any idea? > > Since nobody else is reporting this, it seems like it must be either > something messed up about your system, or something wrong with your > copy of the PG sources. In the latter connection I confess to having > little confidence in the git mirror setup. Have you tried diffing > your source tree against a CVS checkout or nightly snapshot tarball? I've tried clean tarball of alpha4, alpha5, so source messing seems less possible. I'll try CVS HEAD. I've check various real (not VM) Linux boxes and didn't hit the same situation. I know it's something around environment issue. > > A different line of attack is to insert some debugging logging into > the postmaster's port-opening code to see if you can find why it's > apparently not doing anything. I'll try it later. But debugging with gdb seems quite clean, bind() and other initializations looks good and postmaster is in ServerLoop(). Regards, -- Hitoshi Harada
2010/4/5 Josh Berkus <josh@agliodbs.com>: > On 4/4/10 9:19 AM, Tom Lane wrote: >> Hitoshi Harada <umi.tanuki@gmail.com> writes: >>> I cannot figure out at all what is wrong. Have any idea? > > We ran Make Check on 4 linux laptops and 3 macs yesterday, and did not > see a hang. > > Could this be an issue with VirtualBox? Have you used this VM for > testing before? > Yeah, this VM has mostly been used to develop my window function's patches, so this is critical for me. Regards, -- Hitoshi Harada
Josh Berkus <josh@agliodbs.com> wrote: > Could this be an issue with VirtualBox? Have you used this VM for > testing before? As I've hit a few bugs in VirtualBox, this is a definite possibility. (So is Tom's suggestion of inconsistent sources.) Because I could, I just installed a new CentOS 5.4 (no updates, 64 bit) VM on VirtualBox 3.1.6 hosted on OX X 10.5.8 (Leopard) with bridged networking. I installed postgresql-9.0alpha5-revised.tar.bz2, compiled it with the default options, ran 'make check' and everything passed: ======================= All 122 tests passed. ======================= I'm not sure that this helps much as it doesn't rule out VirtualBox, but my experience with it running Linux has been positive. I do reboot the host at the first hint of trouble and that fixes many ills; VirtualBox is not what I would consider production ready, but it's competitive and I switched to it from Parallels which I'd paid for. I'd trial VMware but it doesn't claim to support the *BSDs which is a feature I need, and I've seen at least one report where it didn't work well with FreeBSD. That's hearsay, of course, but it makes me cautious, particularly when I've got something that works "well enough". Regards, Giles
On Sun, 2010-04-04 at 12:19 -0400, Tom Lane wrote: > Hitoshi Harada <umi.tanuki@gmail.com> writes: > > I cannot figure out at all what is wrong. Have any idea? > > Since nobody else is reporting this, it seems like it must be either > something messed up about your system, or something wrong with your > copy of the PG sources. The only times that I see this issue over the last years is when I build PostgreSQL RPMs in the xen-based virtual machines. It is rare, and it happens when there are 10+ parallel builds. ...I'm pretty sure that it is related to the available resources during build time... -- Devrim GÜNDÜZ PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer PostgreSQL RPM Repository: http://yum.pgrpms.org Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr http://www.gunduz.org Twitter: http://twitter.com/devrimgunduz
2010/4/5 Giles Lean <giles.lean@pobox.com>: > > Josh Berkus <josh@agliodbs.com> wrote: > >> Could this be an issue with VirtualBox? Have you used this VM for >> testing before? > > I'm not sure that this helps much as it doesn't rule out > VirtualBox, but my experience with it running Linux has > been positive. I do reboot the host at the first hint of > trouble and that fixes many ills; VirtualBox is not what > I would consider production ready, but it's competitive > and I switched to it from Parallels which I'd paid for. Just note, I rebooted the guest VM today and retried but things are as before. The host reboot doesn't affect either. I also tried another CentOS5.4 VM on the same VirtualBox and succeeded to build. Another RHEL Server 5.2 (Tikanga) x86_64 real machine can also build alpha5. Seems quite hopeless but it's surely reproducible. Regards, -- Hitoshi Harada
Hitoshi Harada <umi.tanuki@gmail.com> wrote: > Just note, I rebooted the guest VM today and retried but things are as > before. The host reboot doesn't affect either. Bad luck. :-( > I also tried another CentOS5.4 VM on the same VirtualBox and succeeded > to build. Another RHEL Server 5.2 (Tikanga) x86_64 real machine can > also build alpha5. Well, that sounds like a workaround: at least you can continue work, although it doesn't leave a lot of confidence that you won't have to reinstall _another_ virtual machine from time to time. :-( > Seems quite hopeless but it's surely reproducible. They always are, but it can take a while. Back when I used to work in kernel support we had one problem take a year. We were Not Pleased, to put it mildly. (Nor was the main customer seeing it.) We got to the answer in the end, but oh boy, it was ugly on the way, and not our finest moment. (Answer was that someone left 'volatile' off a variable declaration, so it was a race condition. A race condition that came and went as code around the incorrect line of code was changed, so only some patch levels were affected. Plus, like any race condition, customers either tended never to see it or tended to hit it semi-regularly. It was ... difficult. It was't me who solved it; fortunately qsomeone else "owned" it and eventually thought to disassemble all the different patch levels. (He was looking for a compiler bug, unsurprisingly; when the problem first occurred he had only one patch version to deal with as well, so that approach didn't suggest itself first off, and everyone (where "everyone" was quite a lot of people) overlooked the missing 'volatile' during code reviews.) Back to VirtualBox: it blew up on me _again_ yesterday, refusing to boot FreeBSD as anything but 32 bit, no matter how I coaxed it and no matter that it had been running a 64bit installation for weeks before that. As booting a FreeBSD 64 bit kernel in 32 bit mode doesn't work (surprise!) I have exorcised VirtualBox from my system (like Parallels before it, which was even worse), and am deciding today qwhat real hardware I shall buy to replace VirtualBox. (Before anyone suggests VMware Fusion: they're honest enough to admit to not supporting one of the operating systems I care about, so that's not a solution, unfortunately.) So I can't test with VirtualBox anymore to help out, sorry. Giles