Thread: Server hangs on multiple connections
Hi, I posted this through the web page but it didn't come over the list, so I am sending it directly. Hope that's okay. I am able to get PostgreSQL 7.2.2 built and installed, and postmaster starts up fine, but when hit with multiple simultaneous connections (2 or more), the server freezes up and can only be stopped with immediate mode. The existing postgres processes appear hung (see ps output below) and running psql to connect also just hangs. The problem I'm describing is only happening when I install on the following platform: Yellow Dog Linux 2.3 on PowerPC (PowerMac G4 QuickSilver, dual processor) with 1.5GB RAM. $ cat /proc/cpuinfo processor : 0 cpu : 7450, altivec supported clock : 799MHz revision : 2.1 (pvr 8000 0201) bogomips : 797.90 processor : 1 cpu : 7450, altivec supported clock : 799MHz revision : 2.1 (pvr 8000 0201) bogomips : 797.90 total bogomips : 1595.80 machine : PowerMac3,5 motherboard : PowerMac3,5 MacRISC2 MacRISC Power Macintosh detected as : 69 (PowerMac G4 Silver) pmac flags : 00000000 L2 cache : 256K unified memory : 1536MB pmac-generation : NewWorld $ uname -a Linux chef.rdss.com 2.4.19-4asmp #1 SMP Wed Jun 5 00:59:38 EDT 2002 ppc unknown I have run PostgreSQL since 7.1 successfully on Red Hat Linux i386 and Mac OS X 10.2 ppc (the very box I am currently having problems with) without the lockup problem. I am currently running PostgreSQL 7.2.2 on a Red Hat i386 machine, installed from source, and it's working fine. This problem can be replicated by building PostgreSQL from source and running the 'make check' sequence. It also happens when I 'make install' and then initiate more than one simultaneous connection to the PostgreSQL server. The PostgreSQL server log does not show anything unusual, until I kill the postmaster and then it reports on all the backend connections that were terminated. Here is the sequence of steps I use which results in this condition: $ tar xfz postgresql-7.2.2.tar.gz $ cd postgresql-7.2.2 $ ./configure $ make $ make check It hangs at this step. ps output shows: 5836 pts/0 S 0:00 make check 5919 pts/0 S 0:00 make -C src/test check 5920 pts/0 S 0:00 make -C regress check 5968 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7827 pts/0 S 0:00 /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./ tmp_check/install//usr/local/pgsql/bin/po 7830 pts/0 S 0:00 postgres: stats buffer process 7832 pts/0 S 0:00 postgres: stats collector process 7891 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7892 pts/0 S 0:00 tee ./regression.out 7897 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7898 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7899 pts/0 S 0:00 /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./ tmp_check/install//usr/local/pgsql/bin/ps 7900 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7901 pts/0 S 0:00 /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./ tmp_check/install//usr/local/pgsql/bin/ps 7902 pts/0 S 0:00 /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./ tmp_check/install//usr/local/pgsql/bin/ps 7903 pts/0 S 0:00 postgres: davidc regression [local] DROP 7904 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7905 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7906 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7907 pts/0 S 0:00 /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./ tmp_check/install//usr/local/pgsql/bin/ps 7908 pts/0 S 0:00 postgres: davidc regression [local] SELECT 7909 pts/0 S 0:00 /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./ tmp_check/install//usr/local/pgsql/bin/ps 7910 pts/0 S 0:00 /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./ tmp_check/install//usr/local/pgsql/bin/ps 7911 pts/0 S 0:00 postgres: davidc regression [local] SELECT 7912 pts/0 S 0:00 postgres: davidc regression [local] SELECT 7913 pts/0 S 0:00 postgres: davidc regression [local] SELECT 7914 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7915 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7916 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7917 pts/0 S 0:00 postgres: davidc regression [local] SELECT 7918 pts/0 S 0:00 /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./ tmp_check/install//usr/local/pgsql/bin/ps 7919 pts/0 S 0:00 /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./ tmp_check/install//usr/local/pgsql/bin/ps 7920 pts/0 S 0:00 /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./ tmp_check/install//usr/local/pgsql/bin/ps 7921 pts/0 S 0:00 postgres: davidc regression [local] startup 7922 pts/0 S 0:00 postgres: davidc regression [local] startup 7923 pts/0 S 0:00 postgres: davidc regression [local] startup 7924 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7925 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7926 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7927 pts/0 S 0:00 /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= 7928 pts/0 S 0:00 /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./ tmp_check/install//usr/local/pgsql/bin/ps 7929 pts/0 S 0:00 /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./ tmp_check/install//usr/local/pgsql/bin/ps 7930 pts/0 S 0:00 /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./ tmp_check/install//usr/local/pgsql/bin/ps 7931 pts/0 S 0:00 /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./ tmp_check/install//usr/local/pgsql/bin/ps 7932 pts/0 S 0:00 postgres: davidc regression [local] startup 7933 pts/0 S 0:00 postgres: davidc regression [local] startup 7934 pts/0 S 0:00 postgres: davidc regression [local] startup 7935 pts/0 S 0:00 postgres: davidc regression [local] startup Some other possibly useful details: $ gcc --version 2.95.4 $ make --version GNU Make version 3.79.1, by Richard Stallman and Roland McGrath. Built for powerpc-yellowdog-linux-gnu $ autoconf --version Autoconf version 2.13 $ rpm -qa | grep glibc glibc-2.2.5-1.2.3a glibc-devel-2.2.5-1.2.3a glibc-common-2.2.5-1.2.3a I tried to include the output from the installation commands, but was told the message was too long to post to the list. Please let me know if it would help to send any of that separately. PostgreSQL is a fantastic product, and I thoroughly enjoy using it on other platforms; I would love to get it working on this one as well, and I am at a loss as to why it appears to be hanging here. Thank you in advance for your time in considering this submission. David
David Christian <davidc@comtechmobile.com> writes: > The problem I'm describing is only happening when I install on the > following platform: Yellow Dog Linux 2.3 on PowerPC (PowerMac G4 > QuickSilver, dual processor) with 1.5GB RAM. Hmm ... I'm not sure whether anyone's tried it with a dual-processor PPC system before. I wonder if there's some problem with the PPC spinlock code given multiple CPUs? Could you build with --enable-debug and --enable-cassert (if you didn't already), repeat the 'make check' scenario, and then attach to a few of the stuck backend processes with gdb and get stack traces from them? That would give us a little more info to work with. > I have run PostgreSQL since 7.1 successfully on Red Hat Linux i386 and > Mac OS X 10.2 ppc (the very box I am currently having problems with) > without the lockup problem. Have you run 7.2.* on this same box under OS X? (Ie, could the problem be specific to YDL?) regards, tom lane
On Thursday, Sep 19, 2002, at 17:10 US/Eastern, Tom Lane wrote: > Could you build with --enable-debug and --enable-cassert (if you didn't > already), repeat the 'make check' scenario, and then attach to a few of > the stuck backend processes with gdb and get stack traces from them? > That would give us a little more info to work with. Happy to. Interestingly, when I build with --enable-debug and --enable-cassert, the server doesn't lock up during 'make check', it just (very quickly) fails all of the tests and exits. I tried several times. $ ./configure --enable-debug --enable-cassert $ make $ make check Here is the tail end of the 'make check' output in that case: /bin/sh ./pg_regress --temp-install --top-builddir=../../.. --schedule=./parallel_schedule --multibyte= ============== creating temporary installation ============== ============== initializing database system ============== ============== starting postmaster ============== running on port 65432 with pid 7893 ============== creating database "regression" ============== CREATE DATABASE ============== dropping regression test user accounts ============== ============== installing PL/pgSQL ============== ============== running regression test queries ============== parallel group (13 tests): boolean int4 varchar char name text int2 int8 oid float4 bit numeric float8 boolean ... FAILED char ... FAILED name ... FAILED varchar ... FAILED text ... FAILED int2 ... FAILED int4 ... FAILED int8 ... FAILED oid ... FAILED float4 ... FAILED float8 ... FAILED bit ... FAILED numeric ... FAILED test strings ... FAILED test numerology ... FAILED parallel group (20 tests): point box lseg path circle polygon time date timetz timestamp timestamptz interval abstime tinterval reltime inet comments oidjoins type_sanity opr_sanity point ... FAILED lseg ... FAILED box ... FAILED path ... FAILED polygon ... FAILED circle ... FAILED date ... FAILED time ... FAILED timetz ... FAILED timestamp ... FAILED timestamptz ... FAILED interval ... FAILED abstime ... FAILED reltime ... FAILED tinterval ... FAILED inet ... FAILED comments ... FAILED oidjoins ... FAILED type_sanity ... FAILED opr_sanity ... FAILED test geometry ... FAILED test horology ... FAILED test create_function_1 ... FAILED test create_type ... FAILED test create_table ... FAILED test create_function_2 ... FAILED test copy ... FAILED parallel group (7 tests): constraints triggers create_misc create_operator create_aggregate create_index inherit constraints ... FAILED triggers ... FAILED create_misc ... FAILED create_aggregate ... FAILED create_operator ... FAILED create_index ... FAILED inherit ... FAILED test create_view ... FAILED test sanity_check ... FAILED test errors ... FAILED test select ... FAILED parallel group (16 tests): select_distinct select_into select_distinct_on select_implicit select_having subselect case union join aggregates transactions portals arrays random btree_index hash_index select_into ... FAILED select_distinct ... FAILED select_distinct_on ... FAILED select_implicit ... FAILED select_having ... FAILED subselect ... FAILED union ... FAILED case ... FAILED join ... FAILED aggregates ... FAILED transactions ... FAILED random ... failed (ignored) portals ... FAILED arrays ... FAILED btree_index ... FAILED hash_index ... FAILED test privileges ... ok test misc ... FAILED parallel group (5 tests): alter_table select_views portals_p2 rules foreign_key select_views ... FAILED alter_table ... FAILED portals_p2 ... FAILED rules ... FAILED foreign_key ... FAILED parallel group (3 tests): plpgsql limit temp limit ... FAILED plpgsql ... FAILED temp ... FAILED ============== shutting down postmaster ============== ===================================================== 78 of 79 tests failed, 1 of these failures ignored. ===================================================== The differences that caused some tests to fail can be viewed in the file `./regression.diffs'. A copy of the test summary that you see above is saved in the file `./regression.out'. make[2]: *** [check] Error 1 rm regress.o make[2]: Leaving directory `/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress' make[1]: *** [check] Error 2 make[1]: Leaving directory `/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test' make: *** [check] Error 2 I then tried with just ./configure --enable-debug alone. And it did hang in the place I described in my first message. (Between builds, I rm -rf'd the installation postgresql-7.2.2 directory to be sure I was using fully clean source each time.) $ ps auxw | grep 'postgres:' davidc 15639 0.0 0.0 7212 1380 pts/0 S 21:44 0:00 postgres: stats buffer process davidc 15641 0.0 0.0 6268 1428 pts/0 S 21:44 0:00 postgres: stats collector process davidc 15712 0.0 0.2 6664 3176 pts/0 S 21:44 0:00 postgres: davidc regression [local] idle davidc 15715 0.0 0.1 6660 3040 pts/0 S 21:44 0:00 postgres: davidc regression [local] SELECT davidc 15716 0.0 0.1 6660 3044 pts/0 S 21:44 0:00 postgres: davidc regression [local] SELECT davidc 15717 0.0 0.1 6660 2944 pts/0 S 21:44 0:00 postgres: davidc regression [local] idle davidc 15722 0.0 0.1 6660 2864 pts/0 S 21:44 0:00 postgres: davidc regression [local] SELECT davidc 15731 0.0 0.1 6572 2140 pts/0 S 21:44 0:00 postgres: davidc regression [local] startup davidc 15732 0.0 0.1 6568 1944 pts/0 S 21:44 0:00 postgres: davidc regression [local] startup davidc 15733 0.0 0.1 6620 2524 pts/0 S 21:44 0:00 postgres: davidc regression [local] SELECT davidc 15737 0.0 0.1 6568 1980 pts/0 S 21:44 0:00 postgres: davidc regression [local] startup davidc 15738 0.0 0.1 6660 2844 pts/0 S 21:44 0:00 postgres: davidc regression [local] CREATE davidc 15742 0.0 0.1 6568 1940 pts/0 S 21:44 0:00 postgres: davidc regression [local] startup davidc 15743 0.0 0.1 6568 1876 pts/0 S 21:44 0:00 postgres: davidc regression [local] startup davidc 15744 0.0 0.1 6548 1696 pts/0 S 21:44 0:00 postgres: davidc regression [local] startup I don't really know what I'm doing with gdb, but I scanned the man page, and here's what I typed: $ gdb src/test/regress/tmp_check/install/usr/local/pgsql/bin/postgres 15715 GNU gdb Yellow Dog Linux (5.1.1-1b) Copyright 2002 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "ppc-yellowdog-linux"... /home/davidc/src/PostgreSQL/postgresql-7.2.2/15715: No such file or directory. Attaching to program: /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/ tmp_check/install/usr/local/pgsql/bin/postgres, process 15715 Reading symbols from /usr/lib/libz.so.1...done. Loaded symbols for /usr/lib/libz.so.1 Reading symbols from /lib/libcrypt.so.1...done. Loaded symbols for /lib/libcrypt.so.1 Reading symbols from /lib/libresolv.so.2...done. Loaded symbols for /lib/libresolv.so.2 Reading symbols from /lib/libnsl.so.1...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/libm.so.6...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /usr/lib/libhistory.so.4...done. Loaded symbols for /usr/lib/libhistory.so.4 Reading symbols from /lib/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld.so.1...done. Loaded symbols for /lib/ld.so.1 0x0fdc297c in __syscall_ipc () at soinit.c:76 76 soinit.c: No such file or directory. in soinit.c (gdb) bt #0 0x0fdc297c in __syscall_ipc () at soinit.c:76 #1 0x0fdc38c0 in semop (semid=4, sops=0x7fffea18, nsops=1) at ../sysdeps/unix/sysv/linux/semop.c:36 #2 0x100e4424 in IpcSemaphoreLock () #3 0x100eb018 in LWLockAcquire () #4 0x100e7f3c in LockAcquire () #5 0x100e7434 in LockRelation () #6 0x1002cc5c in relation_openr () #7 0x1002cdac in heap_openr () #8 0x100dc100 in fireRIRrules () #9 0x100dc878 in QueryRewrite () #10 0x100eeee4 in pg_analyze_and_rewrite () #11 0x100ef244 in pg_exec_query_string () #12 0x100f0688 in PostgresMain () #13 0x100cf3dc in DoBackend () #14 0x100cec54 in BackendStartup () #15 0x100cdaac in ServerLoop () #16 0x100cd564 in PostmasterMain () #17 0x100a26b8 in main () #18 0x0fd07f70 in __libc_start_main (argc=4, ubp_av=0x7ffff814, ubp_ev=0x1, auxvec=0x7ffff8a8, rtld_fini=0x4, stinfo=0x10154c20, stack_on_entry=0x1) at ../sysdeps/powerpc/elf/libc-start.c:119 (gdb) q The program is running. Quit anyway (and detach it)? (y or n) y Detaching from program: /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/ tmp_check/install/usr/local/pgsql/bin/postgres, process 15715 $ gdb src/test/regress/tmp_check/install/usr/local/pgsql/bin/postgres 15738 GNU gdb Yellow Dog Linux (5.1.1-1b) Copyright 2002 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "ppc-yellowdog-linux"... /home/davidc/src/PostgreSQL/postgresql-7.2.2/15738: No such file or directory. Attaching to program: /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/ tmp_check/install/usr/local/pgsql/bin/postgres, process 15738 Reading symbols from /usr/lib/libz.so.1...done. Loaded symbols for /usr/lib/libz.so.1 Reading symbols from /lib/libcrypt.so.1...done. Loaded symbols for /lib/libcrypt.so.1 Reading symbols from /lib/libresolv.so.2...done. Loaded symbols for /lib/libresolv.so.2 Reading symbols from /lib/libnsl.so.1...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/libm.so.6...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /usr/lib/libhistory.so.4...done. Loaded symbols for /usr/lib/libhistory.so.4 Reading symbols from /lib/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld.so.1...done. Loaded symbols for /lib/ld.so.1 0x0fdc297c in __syscall_ipc () at soinit.c:76 76 soinit.c: No such file or directory. in soinit.c (gdb) bt #0 0x0fdc297c in __syscall_ipc () at soinit.c:76 #1 0x0fdc38c0 in semop (semid=4, sops=0x7fffe7a8, nsops=1) at ../sysdeps/unix/sysv/linux/semop.c:36 #2 0x100e4424 in IpcSemaphoreLock () #3 0x100eb018 in LWLockAcquire () #4 0x100e7f3c in LockAcquire () #5 0x100e7434 in LockRelation () #6 0x1003206c in index_beginscan () #7 0x1013d280 in SearchCatCache () #8 0x101420c8 in SearchSysCache () #9 0x100502c0 in CatalogIndexInsert () #10 0x1004baf0 in AddNewRelationTuple () #11 0x1004bd10 in heap_create_with_catalog () #12 0x1006c86c in DefineRelation () #13 0x100f1828 in ProcessUtility () #14 0x100ef2f8 in pg_exec_query_string () #15 0x100f0688 in PostgresMain () #16 0x100cf3dc in DoBackend () #17 0x100cec54 in BackendStartup () #18 0x100cdaac in ServerLoop () #19 0x100cd564 in PostmasterMain () #20 0x100a26b8 in main () #21 0x0fd07f70 in __libc_start_main (argc=4, ubp_av=0x7ffff814, ubp_ev=0x1, auxvec=0x7ffff8a8, rtld_fini=0x4, stinfo=0x10154c20, stack_on_entry=0x1) ---Type <return> to continue, or q <return> to quit--- at ../sysdeps/powerpc/elf/libc-start.c:119 (gdb) q The program is running. Quit anyway (and detach it)? (y or n) y Detaching from program: /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/ tmp_check/install/usr/local/pgsql/bin/postgres, process 15738 $ gdb src/test/regress/tmp_check/install/usr/local/pgsql/bin/postgres 15744 GNU gdb Yellow Dog Linux (5.1.1-1b) Copyright 2002 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "ppc-yellowdog-linux"... /home/davidc/src/PostgreSQL/postgresql-7.2.2/15744: No such file or directory. Attaching to program: /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/ tmp_check/install/usr/local/pgsql/bin/postgres, process 15744 Reading symbols from /usr/lib/libz.so.1...done. Loaded symbols for /usr/lib/libz.so.1 Reading symbols from /lib/libcrypt.so.1...done. Loaded symbols for /lib/libcrypt.so.1 Reading symbols from /lib/libresolv.so.2...done. Loaded symbols for /lib/libresolv.so.2 Reading symbols from /lib/libnsl.so.1...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/libm.so.6...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /usr/lib/libhistory.so.4...done. Loaded symbols for /usr/lib/libhistory.so.4 Reading symbols from /lib/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld.so.1...done. Loaded symbols for /lib/ld.so.1 0x0fdc297c in __syscall_ipc () at soinit.c:76 76 soinit.c: No such file or directory. in soinit.c (gdb) bt #0 0x0fdc297c in __syscall_ipc () at soinit.c:76 #1 0x0fdc38c0 in semop (semid=4, sops=0x7fffe738, nsops=1) at ../sysdeps/unix/sysv/linux/semop.c:36 #2 0x100e4424 in IpcSemaphoreLock () #3 0x100eb018 in LWLockAcquire () #4 0x100e7f3c in LockAcquire () #5 0x100e79b4 in XactLockTableInsert () #6 0x10040828 in StartTransaction () #7 0x10040bb0 in StartTransactionCommand () #8 0x1014a610 in InitPostgres () #9 0x100f036c in PostgresMain () #10 0x100cf3dc in DoBackend () #11 0x100cec54 in BackendStartup () #12 0x100cdaac in ServerLoop () #13 0x100cd564 in PostmasterMain () #14 0x100a26b8 in main () #15 0x0fd07f70 in __libc_start_main (argc=4, ubp_av=0x7ffff814, ubp_ev=0x1, auxvec=0x7ffff8a8, rtld_fini=0x4, stinfo=0x10154c20, stack_on_entry=0x1) at ../sysdeps/powerpc/elf/libc-start.c:119 (gdb) q The program is running. Quit anyway (and detach it)? (y or n) y Detaching from program: /home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/ tmp_check/install/usr/local/pgsql/bin/postgres, process 15744 If I goofed on this, I'm afraid I will need to ask for some hand-holding with using gdb properly. I'm happy to go through the steps to get you what you need to see. >> I have run PostgreSQL since 7.1 successfully on Red Hat Linux i386 and >> Mac OS X 10.2 ppc (the very box I am currently having problems with) >> without the lockup problem. > > Have you run 7.2.* on this same box under OS X? (Ie, could the problem > be specific to YDL?) Yes, I have, and I can hammer it all I want without it hanging. Interestingly, I tried Yellow Dog's RPM also (7.2) and it exhibits the same behavior (i.e., locking up on multiple connections). Thanks, David
David Christian <davidc@comtechmobile.com> writes: > Happy to. Interestingly, when I build with --enable-debug and > --enable-cassert, the server doesn't lock up during 'make check', it > just (very quickly) fails all of the tests and exits. I tried several > times. Oh, that's interesting; that says that an Assert() check is failing. We should investigate that first. There should be a core file left in the database subdirectory after the assert failure --- would you gdb it and get a stack trace from it? Also, you will probably find some useful messages in the postmaster log (which should be left in the log/ subdirectory of the regress tests). > (gdb) bt > #0 0x0fdc297c in __syscall_ipc () at soinit.c:76 > #1 0x0fdc38c0 in semop (semid=4, sops=0x7fffea18, nsops=1) at > ../sysdeps/unix/sysv/linux/semop.c:36 > #2 0x100e4424 in IpcSemaphoreLock () > #3 0x100eb018 in LWLockAcquire () > #4 0x100e7f3c in LockAcquire () > #5 0x100e7434 in LockRelation () Sure enough, it would seem that everyone's stuck waiting for a lock. But let's chase the Assert first; that might identify the problem. regards, tom lane
On Thursday, Sep 19, 2002, at 18:33 US/Eastern, Tom Lane wrote: > David Christian <davidc@comtechmobile.com> writes: >> Happy to. Interestingly, when I build with --enable-debug and >> --enable-cassert, the server doesn't lock up during 'make check', it >> just (very quickly) fails all of the tests and exits. I tried several >> times. > > Oh, that's interesting; that says that an Assert() check is failing. > We should investigate that first. > > There should be a core file left in the database subdirectory after > the assert failure --- would you gdb it and get a stack trace from it? > Also, you will probably find some useful messages in the postmaster > log (which should be left in the log/ subdirectory of the regress > tests) Unfortunately, I see no core file under the source tree after the assert failure. The postmaster.log does show entries for failed assertions. It is 246 lines long, and I am pasting it to the bottom of this message. >> (gdb) bt >> #0 0x0fdc297c in __syscall_ipc () at soinit.c:76 >> #1 0x0fdc38c0 in semop (semid=4, sops=0x7fffea18, nsops=1) at >> ../sysdeps/unix/sysv/linux/semop.c:36 >> #2 0x100e4424 in IpcSemaphoreLock () >> #3 0x100eb018 in LWLockAcquire () >> #4 0x100e7f3c in LockAcquire () >> #5 0x100e7434 in LockRelation () > > Sure enough, it would seem that everyone's stuck waiting for a lock. > But let's chase the Assert first; that might identify the problem. Okay, hope this helps. I really appreciate the time you are taking to look at this. David [davidc@chef ~/src/PostgreSQL/postgresql-7.2.2]$ find . -name '*core*' ./contrib/retep/uk/org/retep/xml/core ./src/interfaces/jdbc/org/postgresql/core [davidc@chef ~/src/PostgreSQL/postgresql-7.2.2/src/test/regress/log]$ cat postmaster.log DEBUG: database system was shut down at 2002-09-20 02:46:51 GMT DEBUG: checkpoint record is at 0/113640 DEBUG: redo record is at 0/113640; undo record is at 0/0; shutdown TRUE DEBUG: next transaction id: 89; next oid: 16556 DEBUG: database system is ready ERROR: DROP GROUP: group "regressgroup1" does not exist TRAP: Failed Assertion("!(lock->shared > 0):", File: "lwlock.c", Line: 434) !(lock->shared > 0) (0) [Success] DEBUG: server process (pid 22628) was terminated by signal 6 DEBUG: terminating any other active server processes NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. FATAL 1: The database system is in recovery mode NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. DEBUG: all server processes terminated; reinitializing shared memory and semaphores DEBUG: database system was interrupted at 2002-09-20 02:46:51 GMT DEBUG: checkpoint record is at 0/113640 DEBUG: redo record is at 0/113640; undo record is at 0/0; shutdown TRUE DEBUG: next transaction id: 89; next oid: 16556 DEBUG: database system was not properly shut down; automatic recovery in progress DEBUG: redo starts at 0/113680 DEBUG: ReadRecord: record with zero length at 0/138818 DEBUG: redo done at 0/1387F0 FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up DEBUG: database system is ready ERROR: CREATE USER: user name "regressuser4" already exists NOTICE: ALTER GROUP: user "regressuser2" is already in group "regressgroup2" ERROR: atest2: Permission denied. ERROR: atest2: Permission denied. ERROR: atest2: Permission denied. ERROR: atest2: Permission denied. ERROR: LOCK TABLE: permission denied ERROR: atest2: Permission denied. ERROR: permission denied ERROR: atest2: Permission denied. ERROR: atest1: Permission denied. ERROR: atest2: Permission denied. ERROR: atest1: Permission denied. ERROR: atest1: Permission denied. ERROR: atest2: Permission denied. ERROR: atest1: Permission denied. ERROR: atest2: Permission denied. ERROR: atest2: Permission denied. ERROR: atest2: Permission denied. ERROR: atest2: Permission denied. ERROR: atest2: Permission denied. ERROR: atest3: Permission denied. ERROR: has_table_privilege: relation "pg_shad" does not exist ERROR: user "nosuchuser" does not exist ERROR: has_table_privilege: invalid privilege type sel ERROR: pg_aclcheck: invalid user id 4293967297 ERROR: has_table_privilege: invalid relation oid 1 ERROR: Relation "onek" does not exist ERROR: Relation "onek" does not exist ERROR: Relation "tmp" does not exist ERROR: Relation "tmp" does not exist ERROR: table "tmp" does not exist ERROR: Relation "onek" does not exist ERROR: Relation "onek" does not exist ERROR: Relation "onek" does not exist ERROR: Relation "onek" does not exist ERROR: Relation "onek2" does not exist ERROR: Relation "onek2" does not exist ERROR: Relation "onek2" does not exist ERROR: Relation "stud_emp" does not exist ERROR: Relation "stud_emp" does not exist ERROR: Relation "stud_emp" does not exist ERROR: Relation "stud_emp" does not exist ERROR: Relation "a_star" does not exist ERROR: Relation "b_star" does not exist ERROR: Relation "c_star" does not exist ERROR: Relation "d_star" does not exist ERROR: Relation "e_star" does not exist ERROR: Relation "f_star" does not exist ERROR: Relation "a_star" does not exist ERROR: Relation "a_star" does not exist ERROR: Relation "f_star" does not exist ERROR: Relation "e_star" does not exist ERROR: Relation "d_star" does not exist ERROR: Relation "c_star" does not exist ERROR: Relation "b_star" does not exist ERROR: Relation "a_star" does not exist ERROR: Relation "a_star" does not exist ERROR: Relation "a_star" does not exist ERROR: Relation "a_star" does not exist ERROR: Relation "a_star" does not exist ERROR: Relation "a_star" does not exist ERROR: Relation "f_star" does not exist ERROR: Relation "f_star" does not exist ERROR: Relation "e_star" does not exist ERROR: Relation "e_star" does not exist ERROR: Relation "a_star" does not exist ERROR: Relation "a_star" does not exist ERROR: Relation "person" does not exist ERROR: Relation "person" does not exist ERROR: Relation "hobbies_r" does not exist ERROR: Relation "hobbies_r" does not exist ERROR: Relation "person" does not exist ERROR: Relation "person" does not exist ERROR: Relation "person" does not exist ERROR: Relation "person" does not exist ERROR: Relation "person" does not exist ERROR: Relation "person" does not exist ERROR: Function 'user_relns()' does not exist Unable to identify a function that satisfies the given argument types You may need to add explicit typecasts ERROR: Function 'hobbies_by_name(unknown)' does not exist Unable to identify a function that satisfies the given argument types You may need to add explicit typecasts ERROR: Function 'oldstyle_length(int4, text)' does not exist Unable to identify a function that satisfies the given argument types You may need to add explicit typecasts ERROR: Relation "street" does not exist ERROR: Relation "iexit" does not exist ERROR: Relation "toyemp" does not exist TRAP: Failed Assertion("!(lock->shared > 0):", File: "lwlock.c", Line: 434) !(lock->shared > 0) (0) [Interrupted system call] DEBUG: server process (pid 23536) was terminated by signal 6 DEBUG: terminating any other active server processes NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. DEBUG: all server processes terminated; reinitializing shared memory and semaphores DEBUG: database system was interrupted at 2002-09-20 02:46:54 GMT DEBUG: checkpoint record is at 0/138818 DEBUG: redo record is at 0/138818; undo record is at 0/0; shutdown TRUE DEBUG: next transaction id: 140; next oid: 24748 DEBUG: database system was not properly shut down; automatic recovery in progress DEBUG: redo starts at 0/138858 DEBUG: ReadRecord: record with zero length at 0/16D820 DEBUG: redo done at 0/16D7F8 FATAL 1: The database system is starting up FATAL 1: The database system is starting up FATAL 1: The database system is starting up DEBUG: smart shutdown request DEBUG: database system is ready DEBUG: shutting down DEBUG: database system is shut down
David Christian <davidc@comtechmobile.com> writes: > On Thursday, Sep 19, 2002, at 18:33 US/Eastern, Tom Lane wrote: >> There should be a core file left in the database subdirectory after >> the assert failure --- would you gdb it and get a stack trace from it? > Unfortunately, I see no core file under the source tree after the > assert failure. If you are using "make check" then look for src/test/regress/tmp_check/data/base/*/core If you don't see one then you must be running with a ulimit setting that forbids core dumps --- try "ulimit -c unlimited" before starting the postmaster. > TRAP: Failed Assertion("!(lock->shared > 0):", File: "lwlock.c", Line: > 434) This confirms my suspicion that something is busted in lock handling on your machine, but there's not enough info here to tell just what. We still need a stack trace. Another interesting line of attack would be to try compiling src/backend/storage/lmgr/lwlock.c at different optimization levels, to see if the problem goes away with less optimization. We saw a problem on AIX (if memory serves) before 7.2 release that turned out to be due to overaggressive optimization by the compiler. We thought we'd added enough "volatile" keywords to lwlock.c to discourage any code rearrangement, but maybe we still need more. regards, tom lane
On Friday, Sep 20, 2002, at 11:30 US/Eastern, Tom Lane wrote: > If you are using "make check" then look for > src/test/regress/tmp_check/data/base/*/core Thanks. Here it is: $ gdb src/test/regress/tmp_check/install/usr/local/pgsql/bin/postmaster src/test/regress/tmp_check/data/base/16556/core GNU gdb Yellow Dog Linux (5.1.1-1b) Copyright 2002 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "ppc-yellowdog-linux"... Core was generated by `postgres: davidc regression [local] startup '. Program terminated with signal 6, Aborted. Reading symbols from /usr/lib/libz.so.1...done. Loaded symbols for /usr/lib/libz.so.1 Reading symbols from /lib/libcrypt.so.1...done. Loaded symbols for /lib/libcrypt.so.1 Reading symbols from /lib/libresolv.so.2...done. Loaded symbols for /lib/libresolv.so.2 Reading symbols from /lib/libnsl.so.1...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/libm.so.6...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /usr/lib/libhistory.so.4...done. Loaded symbols for /usr/lib/libhistory.so.4 Reading symbols from /lib/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld.so.1...done. Loaded symbols for /lib/ld.so.1 #0 0x0fd1be44 in kill () at soinit.c:76 76 soinit.c: No such file or directory. in soinit.c (gdb) bt #0 0x0fd1be44 in kill () at soinit.c:76 #1 0x0fd1bcc0 in raise (sig=6) at ../sysdeps/posix/raise.c:27 #2 0x0fd1d374 in abort () at ../sysdeps/generic/abort.c:88 #3 0x1016882c in vararg_format (fmt=0x0) at excabort.c:27 #4 0x10168744 in ExcUnCaught (excP=0x101f0968, detail=0, data=0x0, message=0x101c16b4 "!(lock->shared > 0)") at exc.c:168 #5 0x101687d4 in ExcRaise (excP=0x101f0968, detail=0, data=0x0, message=0x101c16b4 "!(lock->shared > 0)") at exc.c:185 #6 0x10167a4c in ExceptionalCondition (conditionName=0x101c16b4 "!(lock->shared > 0)", exceptionP=0x101f0968, detail=0x0, fileName=0x8 <Address 0x8 out of bounds>, lineNumber=37) at assert.c:70 #7 0x1010ca84 in LWLockRelease (lockid=BufMgrLock) at lwlock.c:434 #8 0x10108a84 in LockAcquire (lockmethod=1, locktag=0x300eb548, xid=806273048, lockmode=1, dontWait=0 '\000') at lock.c:723 #9 0x10107640 in LockRelation (relation=0x102507b8, lockmode=1) at lmgr.c:153 #10 0x100333b0 in relation_openr (relationName=0x101d47a0 "pg_class", lockmode=1) at heapam.c:524 #11 0x1003355c in heap_openr (relationName=0x0, lockmode=6) at heapam.c:595 #12 0x101613a0 in scan_pg_rel_ind (buildinfo={infotype = 2, i = {info_id = 270195424, info_name = 0x101adae0 "pg_trigger_tgrelid_index"}}) at relcache.c:356 #13 0x10161254 in ScanPgRelation (buildinfo={infotype = 2, i = {info_id = 270195424, info_name = 0x101adae0 "pg_trigger_tgrelid_index"}}) at relcache.c:284 #14 0x10162854 in RelationBuildDesc (buildinfo={infotype = 2, i = {info_id = 270195424, info_name = 0x101adae0 "pg_trigger_tgrelid_index"}}, oldrelation=0x0) at relcache.c:968 #15 0x1016335c in RelationNameGetRelation (relationName=0x101adae0 "pg_trigger_tgrelid_index") at relcache.c:1493 #16 0x10033380 in relation_openr (relationName=0x101adae0 "pg_trigger_tgrelid_index", lockmode=0) at heapam.c:518 #17 0x1003ae20 in index_openr (relationName=0x0) at indexam.c:149 #18 0x1009a1a0 in RelationBuildTriggers (relation=0x301897d8) at trigger.c:551 #19 0x1016290c in RelationBuildDesc (buildinfo={infotype = 2, i = {info_id = 270357192, info_name = 0x101d52c8 "pg_shadow"}}, oldrelation=0x301897d8) at relcache.c:1033 #20 0x1016335c in RelationNameGetRelation (relationName=0x101d52c8 "pg_shadow") at relcache.c:1493 #21 0x10033380 in relation_openr (relationName=0x101d52c8 "pg_shadow", lockmode=0) at heapam.c:518 #22 0x1003355c in heap_openr (relationName=0x0, lockmode=6) at heapam.c:595 #23 0x1015ec20 in CatalogCacheInitializeCache (cache=0x102718b8) at catcache.c:216 #24 0x10160194 in SearchCatCache (cache=0x102718b8, v1=270840281, v2=0, v3=0, v4=0) at catcache.c:862 #25 0x10165c50 in SearchSysCache (cacheId=22, key1=270840281, key2=0, key3=0, key4=0) at syscache.c:461 #26 0x1016dba4 in InitializeSessionUserId (username=0x1024b1d9 "davidc") at miscinit.c:450 #27 0x1016ea70 in InitPostgres (dbname=0x10230f40 "regression", username=0x1024b1d9 "davidc") at postinit.c:337 #28 0x10112190 in PostgresMain (argc=4, argv=0x7fffecd8, username=0x1024b1d9 "davidc") at postgres.c:1684 #29 0x100ecb54 in DoBackend (port=0x1024b0a8) at postmaster.c:2243 #30 0x100ec3cc in BackendStartup (port=0x1024b0a8) at postmaster.c:1874 #31 0x100eb224 in ServerLoop () at postmaster.c:977 #32 0x100eaccc in PostmasterMain (argc=4, argv=0x1022ab60) at postmaster.c:771 #33 0x100bdf68 in main (argc=4, argv=0x7ffff814) at main.c:206 #34 0x0fd07f70 in __libc_start_main (argc=4, ubp_av=0x7ffff814, ubp_ev=0x0, auxvec=0x7ffff8a8, rtld_fini=Cannot access memory at address 0x0 ) at ../sysdeps/powerpc/elf/libc-start.c:119 > Another interesting line of attack would be to try compiling > src/backend/storage/lmgr/lwlock.c at different optimization levels, > to see if the problem goes away with less optimization. We saw a > problem on AIX (if memory serves) before 7.2 release that turned out > to be due to overaggressive optimization by the compiler. We thought > we'd added enough "volatile" keywords to lwlock.c to discourage any > code rearrangement, but maybe we still need more. Okay, I will try to figure out how to do what you just said :-) and meanwhile hope the stack trace above is helpful. Thanks! David
On Friday, Sep 20, 2002, at 11:30 US/Eastern, Tom Lane wrote: > Another interesting line of attack would be to try compiling > src/backend/storage/lmgr/lwlock.c at different optimization levels, > to see if the problem goes away with less optimization. We saw a > problem on AIX (if memory serves) before 7.2 release that turned out > to be due to overaggressive optimization by the compiler. We thought > we'd added enough "volatile" keywords to lwlock.c to discourage any > code rearrangement, but maybe we still need more. This seems to work: $ ./configure $ make $ cd src/backend/storage/lmgr $ rm lwlock.o $ gcc -O0 -g -Wall -Wmissing-prototypes -Wmissing-declarations -I../../../../src/include -c -o lwlock.o lwlock.c $ cd - $ make check All tests pass except 'geometry'. I also tried the above with -O1, and it still failed on 'make check'. So, is it safe to proceed this way? If this turns out to be the solution, is there anything I should be aware of with regard to stability and performance vs. a normal install? Thanks, David
David Christian <davidc@comtechmobile.com> writes: > On Friday, Sep 20, 2002, at 11:30 US/Eastern, Tom Lane wrote: >> Another interesting line of attack would be to try compiling >> src/backend/storage/lmgr/lwlock.c at different optimization levels, [ and indeed the problem goes away at -O0 ] > So, is it safe to proceed this way? If this turns out to be the > solution, is there anything I should be aware of with regard to > stability and performance vs. a normal install? This should be stable; whether there's a measurable performance hit from de-optimizing just that one file is hard to say. At this point I would say that the problem is that the compiler's optimizer is rearranging the order of operations inside lwlock.c in a way that breaks the code for parallel operations. This could be a compiler bug, or it could be that the compiler is doing something it's allowed to do under the C specification --- in which case we need to add some more "volatile"s to fix it. Could you send me (off-list, since it's likely to be large) the lwlock.s file produced by gcc -O0 -I../../../../src/include -S lwlock.c as well as the one produced by gcc -O1 -I../../../../src/include -S lwlock.c Groveling through the assembly code should at least tell me what's being changed ... regards, tom lane
Well, the long and the short of it seems to be that no one before you ever tried to run Postgres on a multi-CPU PowerPC machine :-( Some digging around on the net made it clear that we were missing synchronization instructions that are critical for access to shared memory in a multi-CPU system. I have applied the attached patch to CVS tip (7.3beta2-almost). It looks like it will apply cleanly to 7.2.*, so please try it out (with optimization re-enabled) and let us know what you see! (I have confirmed that this patch causes no trouble on LinuxPPC and OS X 10.1, but I do not have a multi-CPU machine to see if it really solves the problem...) regards, tom lane *** src/backend/storage/lmgr/s_lock.c.orig Thu Jun 20 16:29:35 2002 --- src/backend/storage/lmgr/s_lock.c Fri Sep 20 20:11:53 2002 *************** *** 115,120 **** --- 115,123 ---- /* used in darwin. */ /* We key off __APPLE__ here because this function differs from * the LinuxPPC implementation only in compiler syntax. + * + * NOTE: per the Enhanced PowerPC Architecture manual, v1.0 dated 7-May-2002, + * an isync is a sufficient synchronization barrier after a lwarx/stwcx loop. */ static void tas_dummy() *************** *** 134,139 **** --- 137,143 ---- fail: li r3,1 \n\ blr \n\ success: \n\ + isync \n\ li r3,0 \n\ blr \n\ "); *************** *** 158,163 **** --- 162,168 ---- fail: li 3,1 \n\ blr \n\ success: \n\ + isync \n\ li 3,0 \n\ blr \n\ "); *** src/include/storage/s_lock.h.orig Mon Sep 2 09:50:09 2002 --- src/include/storage/s_lock.h Fri Sep 20 20:11:46 2002 *************** *** 217,222 **** --- 217,237 ---- #endif /* defined(__mc68000__) && defined(__linux__) */ + #if defined(__ppc__) || defined(__powerpc__) + /* + * We currently use out-of-line assembler for TAS on PowerPC; see s_lock.c. + * S_UNLOCK is almost standard but requires a "sync" instruction. + */ + #define S_UNLOCK(lock) \ + do \ + {\ + __asm__ __volatile__ (" sync \n"); \ + *((volatile slock_t *) (lock)) = 0; \ + } while (0) + + #endif /* defined(__ppc__) || defined(__powerpc__) */ + + #if defined(NEED_VAX_TAS_ASM) /* * VAXen -- even multiprocessor ones
I think you've fixed it. With your patch, and a simple $ ./configure $ make $ make check # make install the check works and all tests (except geometry on floating point stuff) pass; and after installing, I can really hammer the server and it doesn't hang. Looks like users on Yellow Dog Linux multi-CPU PowerPC platforms won't have this problem anymore ... that is, when the next person besides me decides to try it. :-) This whole exercise was well worth my time, and I get the added bonus of not having to switch machines. I know you spent a lot of time on it and I greatly appreciate your care and responsiveness. Many thanks - feel free to ask me to check anything else on this platform you would like to see. David On Friday, Sep 20, 2002, at 20:40 US/Eastern, Tom Lane wrote: > Well, the long and the short of it seems to be that no one before you > ever tried to run Postgres on a multi-CPU PowerPC machine :-( > > Some digging around on the net made it clear that we were missing > synchronization instructions that are critical for access to shared > memory in a multi-CPU system. I have applied the attached patch to > CVS tip (7.3beta2-almost). It looks like it will apply cleanly to > 7.2.*, so please try it out (with optimization re-enabled) and let > us know what you see! [snip]