Thread: isolation test deadlocking on buildfarm member coypu

isolation test deadlocking on buildfarm member coypu

From
Rémi Zara
Date:
Hi,

Isolation tests seem to deadlock on buildfarm member coypu (NetBSD/powerpc 5.1).

Here are the process for one deadlock (several days) :

pgbuildfarm  2405  0.0  1.2 26948  7916 ?     Is   Wed02AM    0:00.25 postgres: pgbuildfarm isolationtest [local]
INSERTwaiting  
pgbuildfarm  6559  0.0  0.4 10288  2684 ?     I    Wed02AM    0:00.05 ./pg_isolation_regress
--psqldir=/home/pgbuildfarm/workdir/HEAD/inst/bin--inputdir=. --schedule=./isolation_schedule (pg_isolation_reg) 
pgbuildfarm  8296  0.0  1.8 25924 11880 ?     Ss   Wed02AM    0:59.79 postgres: writer process
pgbuildfarm 10630  0.0  0.4 16212  2916 ?     Ss   Wed02AM    1:43.16 postgres: stats collector process
pgbuildfarm 11814  0.0  0.7  9084  4408 ?     I    Wed01AM    0:00.39 /usr/pkg/bin/perl ./run_branches.pl --run-one
pgbuildfarm 14785  0.0  0.2  3344  1276 ?     I    Wed02AM    0:00.03 gmake NO_LOCALE=1 installcheck
pgbuildfarm 15353  0.0  0.8 26948  4960 ?     Ss   Wed02AM    1:10.55 postgres: autovacuum launcher process
pgbuildfarm 16619  0.0  1.8 15992 11512 ?     I    Wed01AM    0:04.90 /usr/pkg/bin/perl ./run_build.pl --config
build-farm.confHEAD  
pgbuildfarm 17532  0.0  0.5 25924  3052 ?     Ss   Wed02AM    0:34.49 postgres: wal writer process
pgbuildfarm 18073  0.0  1.0 25924  6844 ?     I    Wed02AM    0:19.83 /home/pgbuildfarm/workdir/HEAD/inst/bin/postgres
-Ddata-C  
pgbuildfarm 21310  0.0  1.2 26948  7948 ?     Is   Wed02AM    0:00.06 postgres: pgbuildfarm isolationtest [local] idle
intransaction  
pgbuildfarm 24137  0.0  0.2  3456  1064 ?     Is   Wed01AM    0:00.01 /bin/sh -c cd /home/pgbuildfarm/client-code &&
./run_branches.pl--run-one  
pgbuildfarm 26932  0.0  0.2  3456  1072 ?     I    Wed02AM    0:00.01 sh -c cd pgsql.16619/src/test/isolation && gmake
NO_LOCALE=1installcheck 2>&1  
pgbuildfarm 28430  0.0  1.3 26948  8816 ?     Ss   Wed02AM  408:43.91 postgres: pgbuildfarm isolationtest [local] idle
pgbuildfarm 28809  0.0  0.4 10476  2844 ?     R    Wed02AM   27:48.24 ./isolationtester dbname=isolationtest


I killed one of the postgres process, and it lead to a failure of the buildfarm perl script (probably due to the 2GB+
logfile), and persistance of the postgres cluster (which I had to clean manually) 
I've kept the source and install directories, if useful for investigation.


Here are the processes for a second lock (several hours):

pgbuildfarm  1423  0.0  0.4 16212  2888 ?     Ss    2:11PM    0:04.07 postgres: stats collector process
pgbuildfarm  1604  0.0  1.2 26948  7724 ?     Is    2:13PM    0:00.05 postgres: pgbuildfarm isolationtest [local] idle
intransaction  
pgbuildfarm  2380  0.0  0.4 10288  2624 ?     I     2:12PM    0:00.05 ./pg_isolation_regress
--psqldir=/home/pgbuildfarm/workdir/HEAD/inst/bin--inputdir=. --schedule=./isolation_schedule (pg_isolation_reg) 
pgbuildfarm  3156  0.0  0.8 26948  4920 ?     Ss    2:11PM    0:02.71 postgres: autovacuum launcher process
pgbuildfarm  3253  0.0  0.5 25924  3044 ?     Ss    2:11PM    0:01.65 postgres: wal writer process
pgbuildfarm  9889  0.0  1.8 25924 11856 ?     Ss    2:11PM    0:04.87 postgres: writer process
pgbuildfarm 11359  0.0  1.4 26948  9032 ?     Ss    2:13PM   15:16.63 postgres: pgbuildfarm isolationtest [local] idle
pgbuildfarm 13738  0.0  1.2 26948  7860 ?     Ss    2:13PM    0:00.06 postgres: pgbuildfarm isolationtest [local]
UPDATEwaiting  
pgbuildfarm 14031  0.0  1.0 25924  6800 ?     S     2:11PM    0:01.82 /home/pgbuildfarm/workdir/HEAD/inst/bin/postgres
-Ddata-C  
pgbuildfarm 16555  0.0  0.4 10476  2764 ?     S     2:13PM    1:01.77 ./isolationtester dbname=isolationtest
pgbuildfarm 18038  0.0  0.2  3456  1008 ?     I     2:11PM    0:00.01 sh -c cd pgsql.20707/src/test/isolation && gmake
NO_LOCALE=1installcheck 2>&1  
pgbuildfarm 19057  0.0  0.2  3456  1000 ?     Is    1:30PM    0:00.01 /bin/sh -c cd /home/pgbuildfarm/client-code &&
./run_branches.pl--run-one  
pgbuildfarm 20707  0.0  1.8 15992 11460 ?     I     1:30PM    0:04.92 /usr/pkg/bin/perl ./run_build.pl --config
build-farm.confHEAD  
pgbuildfarm 21749  0.0  0.2  3344  1248 ?     I     2:11PM    0:00.04 gmake NO_LOCALE=1 installcheck
pgbuildfarm 23054  0.0  0.7  9084  4352 ?     I     1:30PM    0:00.38 /usr/pkg/bin/perl ./run_branches.pl --run-one
pgbuildfarm  2125  0.0  0.0  2652     0 ttyp0 R+    5:09PM    0:00.00 grep pgbuildf (bash)
pgbuildfarm  9930  0.0  0.3  2652  2072 ttyp0 S     5:03PM    0:00.09 bash
pgbuildfarm 29194  0.0  0.1  3552   872 ttyp0 O+    5:09PM    0:00.01 ps -aux
pgbuildfarm@cube:workdir$date
Sat Jul 16 17:10:13 CEST 2011

The log fills with:

[4e21807d.2c5f:1060074] LOG:  execute isolationtester_waiting: SELECT 1 WHERE pg_stat_get_backend_waiting($1)
[4e21807d.2c5f:1060075] DETAIL:  parameters: $1 = '5'
[4e21807d.2c5f:1060076] LOG:  execute isolationtester_waiting: SELECT 1 WHERE pg_stat_get_backend_waiting($1)
[4e21807d.2c5f:1060077] DETAIL:  parameters: $1 = '5'

I killed process 1604 (postgres: pgbuildfarm isolationtest [local] idle in transaction). Again, the buildfarm
perlscriptkicked into action. See http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=coypu&dt=2011-07-16%2011%3A30%3A02
forthe logs 

Regards,

Rémi Zara


Re: isolation test deadlocking on buildfarm member coypu

From
Noah Misch
Date:
On Sat, Jul 16, 2011 at 05:50:45PM +0200, Rémi Zara wrote:
> Isolation tests seem to deadlock on buildfarm member coypu (NetBSD/powerpc 5.1).

Thanks for the report and detailed analysis.  I believe the patch here will fix
the problem:
http://archives.postgresql.org/message-id/20110716171121.GB2047@tornado.leadboat.com

If you're up for testing it locally, I'd be interested to hear how it works out.

Thanks,
nm

-- 
Noah Misch                    http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: isolation test deadlocking on buildfarm member coypu

From
Alvaro Herrera
Date:
Excerpts from Noah Misch's message of sáb jul 16 13:25:03 -0400 2011:
> On Sat, Jul 16, 2011 at 05:50:45PM +0200, Rémi Zara wrote:
> > Isolation tests seem to deadlock on buildfarm member coypu (NetBSD/powerpc 5.1).
> 
> Thanks for the report and detailed analysis.  I believe the patch here will fix
> the problem:
> http://archives.postgresql.org/message-id/20110716171121.GB2047@tornado.leadboat.com
> 
> If you're up for testing it locally, I'd be interested to hear how it works out.

Rémi,

I have committed Noah's patch, so please unblock coypu and locust so we
can see whether they are OK now.

Thanks,

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: isolation test deadlocking on buildfarm member coypu

From
Alvaro Herrera
Date:
After committing Noah's patch to fix the isolation tests, there have
been two more failures in Rémi's machines pika and coypu:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pika&dt=2011-07-24%2006%3A45%3A45
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2011-07-23%2021%3A50%3A54

In both cases, the failure is identical in fk-deadlock2:

================== pgsql.20950/src/test/isolation/regression.diffs ===================
*** /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/expected/fk-deadlock2.out    Sun Jul 24 08:46:44
2011
--- /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/results/fk-deadlock2.out    Sun Jul 24 15:11:42 2011
***************
*** 22,29 **** step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2;  step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;
<waiting...> step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
 
- step s1u2: <... completed> ERROR:  deadlock detected step s1c:  COMMIT;  step s2c:  COMMIT;  
--- 22,29 ---- step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2;  step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;
<waiting...> step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  ERROR:  deadlock detected
 
+ step s1u2: <... completed> step s1c:  COMMIT;  step s2c:  COMMIT; 



I think the only explanation necessary for this is that one process
reports its status before the other one.  I think it would be enough to
add another variant of the expected file to fix this problem, but I
don't quite want to do that because we already have three of them, and I
think we would need to add one per existing expected, so we'd end up
with 6 expected files which would be a pain to work with.

Not quite sure what to do with this.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: isolation test deadlocking on buildfarm member coypu

From
Noah Misch
Date:
On Tue, Jul 26, 2011 at 05:04:28PM -0400, Alvaro Herrera wrote:
> *** /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/expected/fk-deadlock2.out    Sun Jul 24 08:46:44
2011
> --- /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/results/fk-deadlock2.out    Sun Jul 24 15:11:42
2011
> ***************
> *** 22,29 ****
>   step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
>   step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
>   step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
> - step s1u2: <... completed>
>   ERROR:  deadlock detected
>   step s1c:  COMMIT; 
>   step s2c:  COMMIT; 
>   
> --- 22,29 ----
>   step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
>   step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
>   step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
>   ERROR:  deadlock detected
> + step s1u2: <... completed>
>   step s1c:  COMMIT; 
>   step s2c:  COMMIT; 
> 
> 
> 
> I think the only explanation necessary for this is that one process
> reports its status before the other one.  I think it would be enough to
> add another variant of the expected file to fix this problem, but I
> don't quite want to do that because we already have three of them, and I
> think we would need to add one per existing expected, so we'd end up
> with 6 expected files which would be a pain to work with.

To really cover the problem in this way, we would need 16*3 variations covering
every permutation of the deadlocks detected when s1 runs the first command.  I
see these other options:

1. Keep increasing the s1 deadlock_timeout until we stop getting these.  This is
simple, but it proportionally slows the test suite for everyone.  No value will
ever be guaranteed sufficient.

2. Post-process the output to ascribe the deadlock detections in a standard way
before comparing with expected output.  This would also let us drop
deadlock_timeout arbitrarily low, giving a rapid test run.

3. Implement actual deadlock priorities, per discussion at
http://archives.postgresql.org/message-id/AANLkTimAqFzKV4Sc1DScruFft_Be78Y-oWPeuURCDnjR@mail.gmail.com
This is much more work, but it would let us drop deadlock_timeout arbitrarily
low and still get consistent results from the start.

-- 
Noah Misch                    http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services