BUG #12718: Apparent LWlock Problem Perhaps With Page Cleanup - Mailing list pgsql-bugs

From charles.harwood@zuerchertech.com
Subject BUG #12718: Apparent LWlock Problem Perhaps With Page Cleanup
Date
Msg-id 20150130221445.2583.8173@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #12718: Apparent LWlock Problem Perhaps With Page Cleanup  (Peter Geoghegan <pg@heroku.com>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      12718
Logged by:          Charles R. Harwood
Email address:      charles.harwood@zuerchertech.com
PostgreSQL version: 9.3.5
Operating system:   Ubuntu  Linux 12.04 Kernel 3.13.0
Description:

Many postgres worker processes stopped responding.  They continued to be in
state 'active' and not 'waiting' for hours.  Each query was on the table
that an autovacuum process was also hung on processing and was the oldest
active transaction on.

Eventually I had to restart the cluster.

The theory is that somehow the VACUUM got the table in a state such that
certain queries ran against it would get hung trying to wrestle an LWlock
away from it.



This shows a subset of the pg_stat_activity many hours into the problem:
# SELECT NOW() - xact_start AS xact_age, query_start, pid, waiting, state,
SUBSTRING(query, 1, 30) FROM pg_stat_activity WHERE state <> 'idle' ORDER BY
NOW() - xact_start DESC;
        xact_age | query_start | pid | waiting | state | substring

-----------------+-------------------------------+-------+---------+--------+----------------------------------------------------
 20:04:20.469875 | 2015-01-29 17:39:21.520094-06 | 24718 | f | active |
autovacuum: VACUUM public.addr
 19:59:58.066609 | 2015-01-29 17:43:43.92336-06 | 24687 | f | active |
SELECT s0.id AS id, f0, f1, f1
 19:59:40.054033 | 2015-01-29 17:44:01.935936-06 | 21386 | f | active |
SELECT s0.id AS id, f0, f1, f1
 19:59:24.117727 | 2015-01-29 17:44:17.872242-06 | 20370 | f | active |
SELECT s0.id AS id, f0, f1, f1
 19:32:51.946155 | 2015-01-29 18:10:50.064425-06 | 2900 | f | active |
UPDATE addresses...
 19:11:38.715295 | 2015-01-29 18:32:03.274674-06 | 13545 | f | active |
SELECT s0.id AS id, f0, f1, f1
 18:53:00.421641 | 2015-01-29 18:50:41.568328-06 | 2367 | f | active |
SELECT '', '', street, city, s
 12:53:32.658331 | 2015-01-30 00:50:09.331638-06 | 22220 | f | active |
SELECT '', '', street, city, s
 11:46:31.322061 | 2015-01-30 01:57:10.667908-06 | 17054 | f | active |
SELECT addressesid FROM ((SEL
 11:36:33.11271 | 2015-01-30 02:07:08.877259-06 | 22378 | f | active |
SELECT addressesid FROM ((SEL
 07:43:06.642865 | 2015-01-30 06:00:37.139222-06 | 26815 | f | active |
SELECT s0.id AS id, f0, f1, f1
 04:03:57.508697 | 2015-01-30 09:39:44.481272-06 | 19351 | f | active |
SELECT addressesid FROM ((SEL
 03:37:39.971883 | 2015-01-30 10:06:02.018086-06 | 10422 | f | active |
SELECT addressesid FROM ((SEL
 02:39:14.576335 | 2015-01-30 11:04:27.413634-06 | 21364 | f | active |
SELECT addressesid FROM ((SEL
 02:24:00.125199 | 2015-01-30 11:19:41.86477-06 | 14600 | f | active |
SELECT addressesid FROM ((SEL
 02:03:32.915133 | 2015-01-30 11:40:09.074836-06 | 22547 | f | active |
SELECT addressesid FROM ((SEL
 01:53:49.639909 | 2015-01-30 11:49:52.35006-06 | 31567 | f | active |
SELECT s0.id AS id, f0, f1, f1
 01:53:44.700417 | 2015-01-30 11:49:57.289552-06 | 31571 | f | active |
SELECT s0.id AS id, f0, f1, f1
 01:53:39.645157 | 2015-01-30 11:50:02.344812-06 | 31623 | f | active |
SELECT s0.id AS id, f0, f1, f1
 01:53:20.892623 | 2015-01-30 11:50:21.097346-06 | 31887 | f | active |
SELECT s0.id AS id, f0, f1, f1
 01:08:30.519618 | 2015-01-30 12:35:11.470351-06 | 21167 | f | active |
SELECT namesid, name, type, in
 01:06:58.586233 | 2015-01-30 12:36:43.403736-06 | 24513 | f | active |
SELECT data.f5 AS y_id, data.f
 01:06:18.159009 | 2015-01-30 12:37:23.83096-06 | 21047 | f | active |
SELECT namesid, name, type, in


Here is what those processes were up to.  I have full backtraces which I
will attach when this gets to a bugtracker.

 # ps -o pid,s,cmd,wchan=WIDE-WCHAN-COLUMN -p

2367,2900,10422,13545,14600,17054,19351,20370,21047,21167,21364,21386,22220,22378,22547,24513,24687,24718,26815,31567,31571,31623,31887
PID S CMD WIDE-WCHAN-COLUMN
2367 S postgres: leds leds 164.154 SYSC_semtimedop
2900 S postgres: leds leds [local] SYSC_semtimedop
10422 S postgres: leds leds 164.154 SYSC_semtimedop
13545 S postgres: leds leds 164.154 SYSC_semtimedop
14600 S postgres: leds leds 164.154 SYSC_semtimedop
17054 S postgres: leds leds 164.154 SYSC_semtimedop
19351 S postgres: leds leds 164.154 SYSC_semtimedop
20370 S postgres: leds leds 164.154 SYSC_semtimedop
21047 S postgres: leds leds 164.154 SYSC_semtimedop
21167 S postgres: leds leds 164.154 SYSC_semtimedop
21364 S postgres: leds leds 164.154 SYSC_semtimedop
21386 S postgres: leds leds 164.154 SYSC_semtimedop
22220 S postgres: leds leds 164.154 SYSC_semtimedop
22378 S postgres: leds leds 164.154 SYSC_semtimedop
22547 S postgres: leds leds 164.154 SYSC_semtimedop
24513 R postgres: leds leds 164.154 -
24687 S postgres: leds leds 164.154 SYSC_semtimedop
24718 S postgres: autovacuum worker poll_schedule_timeout
26815 S postgres: leds leds 164.154 SYSC_semtimedop
31567 S postgres: leds leds 164.154 SYSC_semtimedop
31571 S postgres: leds leds 164.154 SYSC_semtimedop
31623 S postgres: leds leds 164.154 SYSC_semtimedop
31887 S postgres: leds leds 164.154 SYSC_semtimedop


Not sure if it will help finding how this happened, but it has happened
twice, both times on relatively large databases (tens to hundreds of
gigabytes) with several hundred concurrent connectons.  Both times it
happened while running a long-running script that updates each record in the
'addresses' table.  The table contains 1-5 million records, each ~1KB wide
and references several other tables and is itself referenced by a dozen
tables.

The script starts a transaction, runs 20 single-record UPDATEs in quick
succession, COMMITs, and then briefly sleeps.  So if nothing else, the txid
is going up fairly rapidly and the database is working reasonably hard.

Hardware does not appear to be a factor, but both servers on which this
occurred are Dell rackmount servers with two modern intel chips ~E5-2630s
and 32-128GB of memory.

Again, stack traces will be attached.

Thank you.
-Charles

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: rangesel() bugs on basic comparison operators with an empty range
Next
From: satson.csc17@gmail.com
Date:
Subject: BUG #12717: Slow Query