Thread: isolation test deadlocking on buildfarm member coypu
Hi, Isolation tests seem to deadlock on buildfarm member coypu (NetBSD/powerpc 5.1). Here are the process for one deadlock (several days) : pgbuildfarm 2405 0.0 1.2 26948 7916 ? Is Wed02AM 0:00.25 postgres: pgbuildfarm isolationtest [local] INSERTwaiting pgbuildfarm 6559 0.0 0.4 10288 2684 ? I Wed02AM 0:00.05 ./pg_isolation_regress --psqldir=/home/pgbuildfarm/workdir/HEAD/inst/bin--inputdir=. --schedule=./isolation_schedule (pg_isolation_reg) pgbuildfarm 8296 0.0 1.8 25924 11880 ? Ss Wed02AM 0:59.79 postgres: writer process pgbuildfarm 10630 0.0 0.4 16212 2916 ? Ss Wed02AM 1:43.16 postgres: stats collector process pgbuildfarm 11814 0.0 0.7 9084 4408 ? I Wed01AM 0:00.39 /usr/pkg/bin/perl ./run_branches.pl --run-one pgbuildfarm 14785 0.0 0.2 3344 1276 ? I Wed02AM 0:00.03 gmake NO_LOCALE=1 installcheck pgbuildfarm 15353 0.0 0.8 26948 4960 ? Ss Wed02AM 1:10.55 postgres: autovacuum launcher process pgbuildfarm 16619 0.0 1.8 15992 11512 ? I Wed01AM 0:04.90 /usr/pkg/bin/perl ./run_build.pl --config build-farm.confHEAD pgbuildfarm 17532 0.0 0.5 25924 3052 ? Ss Wed02AM 0:34.49 postgres: wal writer process pgbuildfarm 18073 0.0 1.0 25924 6844 ? I Wed02AM 0:19.83 /home/pgbuildfarm/workdir/HEAD/inst/bin/postgres -Ddata-C pgbuildfarm 21310 0.0 1.2 26948 7948 ? Is Wed02AM 0:00.06 postgres: pgbuildfarm isolationtest [local] idle intransaction pgbuildfarm 24137 0.0 0.2 3456 1064 ? Is Wed01AM 0:00.01 /bin/sh -c cd /home/pgbuildfarm/client-code && ./run_branches.pl--run-one pgbuildfarm 26932 0.0 0.2 3456 1072 ? I Wed02AM 0:00.01 sh -c cd pgsql.16619/src/test/isolation && gmake NO_LOCALE=1installcheck 2>&1 pgbuildfarm 28430 0.0 1.3 26948 8816 ? Ss Wed02AM 408:43.91 postgres: pgbuildfarm isolationtest [local] idle pgbuildfarm 28809 0.0 0.4 10476 2844 ? R Wed02AM 27:48.24 ./isolationtester dbname=isolationtest I killed one of the postgres process, and it lead to a failure of the buildfarm perl script (probably due to the 2GB+ logfile), and persistance of the postgres cluster (which I had to clean manually) I've kept the source and install directories, if useful for investigation. Here are the processes for a second lock (several hours): pgbuildfarm 1423 0.0 0.4 16212 2888 ? Ss 2:11PM 0:04.07 postgres: stats collector process pgbuildfarm 1604 0.0 1.2 26948 7724 ? Is 2:13PM 0:00.05 postgres: pgbuildfarm isolationtest [local] idle intransaction pgbuildfarm 2380 0.0 0.4 10288 2624 ? I 2:12PM 0:00.05 ./pg_isolation_regress --psqldir=/home/pgbuildfarm/workdir/HEAD/inst/bin--inputdir=. --schedule=./isolation_schedule (pg_isolation_reg) pgbuildfarm 3156 0.0 0.8 26948 4920 ? Ss 2:11PM 0:02.71 postgres: autovacuum launcher process pgbuildfarm 3253 0.0 0.5 25924 3044 ? Ss 2:11PM 0:01.65 postgres: wal writer process pgbuildfarm 9889 0.0 1.8 25924 11856 ? Ss 2:11PM 0:04.87 postgres: writer process pgbuildfarm 11359 0.0 1.4 26948 9032 ? Ss 2:13PM 15:16.63 postgres: pgbuildfarm isolationtest [local] idle pgbuildfarm 13738 0.0 1.2 26948 7860 ? Ss 2:13PM 0:00.06 postgres: pgbuildfarm isolationtest [local] UPDATEwaiting pgbuildfarm 14031 0.0 1.0 25924 6800 ? S 2:11PM 0:01.82 /home/pgbuildfarm/workdir/HEAD/inst/bin/postgres -Ddata-C pgbuildfarm 16555 0.0 0.4 10476 2764 ? S 2:13PM 1:01.77 ./isolationtester dbname=isolationtest pgbuildfarm 18038 0.0 0.2 3456 1008 ? I 2:11PM 0:00.01 sh -c cd pgsql.20707/src/test/isolation && gmake NO_LOCALE=1installcheck 2>&1 pgbuildfarm 19057 0.0 0.2 3456 1000 ? Is 1:30PM 0:00.01 /bin/sh -c cd /home/pgbuildfarm/client-code && ./run_branches.pl--run-one pgbuildfarm 20707 0.0 1.8 15992 11460 ? I 1:30PM 0:04.92 /usr/pkg/bin/perl ./run_build.pl --config build-farm.confHEAD pgbuildfarm 21749 0.0 0.2 3344 1248 ? I 2:11PM 0:00.04 gmake NO_LOCALE=1 installcheck pgbuildfarm 23054 0.0 0.7 9084 4352 ? I 1:30PM 0:00.38 /usr/pkg/bin/perl ./run_branches.pl --run-one pgbuildfarm 2125 0.0 0.0 2652 0 ttyp0 R+ 5:09PM 0:00.00 grep pgbuildf (bash) pgbuildfarm 9930 0.0 0.3 2652 2072 ttyp0 S 5:03PM 0:00.09 bash pgbuildfarm 29194 0.0 0.1 3552 872 ttyp0 O+ 5:09PM 0:00.01 ps -aux pgbuildfarm@cube:workdir$date Sat Jul 16 17:10:13 CEST 2011 The log fills with: [4e21807d.2c5f:1060074] LOG: execute isolationtester_waiting: SELECT 1 WHERE pg_stat_get_backend_waiting($1) [4e21807d.2c5f:1060075] DETAIL: parameters: $1 = '5' [4e21807d.2c5f:1060076] LOG: execute isolationtester_waiting: SELECT 1 WHERE pg_stat_get_backend_waiting($1) [4e21807d.2c5f:1060077] DETAIL: parameters: $1 = '5' I killed process 1604 (postgres: pgbuildfarm isolationtest [local] idle in transaction). Again, the buildfarm perlscriptkicked into action. See http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=coypu&dt=2011-07-16%2011%3A30%3A02 forthe logs Regards, Rémi Zara
On Sat, Jul 16, 2011 at 05:50:45PM +0200, Rémi Zara wrote: > Isolation tests seem to deadlock on buildfarm member coypu (NetBSD/powerpc 5.1). Thanks for the report and detailed analysis. I believe the patch here will fix the problem: http://archives.postgresql.org/message-id/20110716171121.GB2047@tornado.leadboat.com If you're up for testing it locally, I'd be interested to hear how it works out. Thanks, nm -- Noah Misch http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Excerpts from Noah Misch's message of sáb jul 16 13:25:03 -0400 2011: > On Sat, Jul 16, 2011 at 05:50:45PM +0200, Rémi Zara wrote: > > Isolation tests seem to deadlock on buildfarm member coypu (NetBSD/powerpc 5.1). > > Thanks for the report and detailed analysis. I believe the patch here will fix > the problem: > http://archives.postgresql.org/message-id/20110716171121.GB2047@tornado.leadboat.com > > If you're up for testing it locally, I'd be interested to hear how it works out. Rémi, I have committed Noah's patch, so please unblock coypu and locust so we can see whether they are OK now. Thanks, -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
After committing Noah's patch to fix the isolation tests, there have been two more failures in Rémi's machines pika and coypu: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pika&dt=2011-07-24%2006%3A45%3A45 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2011-07-23%2021%3A50%3A54 In both cases, the failure is identical in fk-deadlock2: ================== pgsql.20950/src/test/isolation/regression.diffs =================== *** /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/expected/fk-deadlock2.out Sun Jul 24 08:46:44 2011 --- /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/results/fk-deadlock2.out Sun Jul 24 15:11:42 2011 *************** *** 22,29 **** step s2u1: UPDATE B SET Col2 = 1 WHERE BID = 2; step s1u2: UPDATE B SET Col2 = 1 WHERE BID = 2; <waiting...> step s2u2: UPDATE B SET Col2 = 1 WHERE BID = 2; - step s1u2: <... completed> ERROR: deadlock detected step s1c: COMMIT; step s2c: COMMIT; --- 22,29 ---- step s2u1: UPDATE B SET Col2 = 1 WHERE BID = 2; step s1u2: UPDATE B SET Col2 = 1 WHERE BID = 2; <waiting...> step s2u2: UPDATE B SET Col2 = 1 WHERE BID = 2; ERROR: deadlock detected + step s1u2: <... completed> step s1c: COMMIT; step s2c: COMMIT; I think the only explanation necessary for this is that one process reports its status before the other one. I think it would be enough to add another variant of the expected file to fix this problem, but I don't quite want to do that because we already have three of them, and I think we would need to add one per existing expected, so we'd end up with 6 expected files which would be a pain to work with. Not quite sure what to do with this. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jul 26, 2011 at 05:04:28PM -0400, Alvaro Herrera wrote: > *** /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/expected/fk-deadlock2.out Sun Jul 24 08:46:44 2011 > --- /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/results/fk-deadlock2.out Sun Jul 24 15:11:42 2011 > *************** > *** 22,29 **** > step s2u1: UPDATE B SET Col2 = 1 WHERE BID = 2; > step s1u2: UPDATE B SET Col2 = 1 WHERE BID = 2; <waiting ...> > step s2u2: UPDATE B SET Col2 = 1 WHERE BID = 2; > - step s1u2: <... completed> > ERROR: deadlock detected > step s1c: COMMIT; > step s2c: COMMIT; > > --- 22,29 ---- > step s2u1: UPDATE B SET Col2 = 1 WHERE BID = 2; > step s1u2: UPDATE B SET Col2 = 1 WHERE BID = 2; <waiting ...> > step s2u2: UPDATE B SET Col2 = 1 WHERE BID = 2; > ERROR: deadlock detected > + step s1u2: <... completed> > step s1c: COMMIT; > step s2c: COMMIT; > > > > I think the only explanation necessary for this is that one process > reports its status before the other one. I think it would be enough to > add another variant of the expected file to fix this problem, but I > don't quite want to do that because we already have three of them, and I > think we would need to add one per existing expected, so we'd end up > with 6 expected files which would be a pain to work with. To really cover the problem in this way, we would need 16*3 variations covering every permutation of the deadlocks detected when s1 runs the first command. I see these other options: 1. Keep increasing the s1 deadlock_timeout until we stop getting these. This is simple, but it proportionally slows the test suite for everyone. No value will ever be guaranteed sufficient. 2. Post-process the output to ascribe the deadlock detections in a standard way before comparing with expected output. This would also let us drop deadlock_timeout arbitrarily low, giving a rapid test run. 3. Implement actual deadlock priorities, per discussion at http://archives.postgresql.org/message-id/AANLkTimAqFzKV4Sc1DScruFft_Be78Y-oWPeuURCDnjR@mail.gmail.com This is much more work, but it would let us drop deadlock_timeout arbitrarily low and still get consistent results from the start. -- Noah Misch http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services