BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. - Mailing list pgsql-bugs

From maxim.boguk@gmail.com
Subject BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
Date
Msg-id 20151001121745.368.58359@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.  (Maxim Boguk <maxim.boguk@gmail.com>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      13657
Logged by:          Maksym Boguk
Email address:      maxim.boguk@gmail.com
PostgreSQL version: 9.4.4
Operating system:   Linux
Description:

Hi,

On one of 9.4.4 replica database I had found strange issue with stuck query
and stuck wal replay process waiting each other but with no detectable
deadlock condtions.

Stuck query runs:
SELECT
        schemaname,
        relname,
        pg_relation_size(relid) as table_size,
        pg_indexes_size(relid) as table_indexes_size,
        seq_scan,
        coalesce(idx_scan,0) as idx_scan,
        seq_tup_read,
        coalesce(idx_tup_fetch,0) as idx_tup_fetch,
        n_tup_ins as inserted,
        n_tup_upd as updated,
        n_tup_del as deleted,
        n_tup_hot_upd as hot_updated
FROM
        pg_stat_user_tables

In pg_stat_activity it's on waiting state.
Locks for this query:
postgres=# select * from pg_locks where pid=862 order by granted, relation;
  locktype  | database | relation | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction | pid |
mode       | granted | fastpath

------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----+-----------------+---------+----------
 relation   |    16424 |    17987 |      |       |            |
 |         |       |          | 7/924904           | 862 | AccessShareLock |
f       | f
 relation   |    16424 |     1259 |      |       |            |
 |         |       |          | 7/924904           | 862 | AccessShareLock |
t       | t
 relation   |    16424 |     2610 |      |       |            |
 |         |       |          | 7/924904           | 862 | AccessShareLock |
t       | t
 relation   |    16424 |     2615 |      |       |            |
 |         |       |          | 7/924904           | 862 | AccessShareLock |
t       | t
 relation   |    16424 |     2662 |      |       |            |
 |         |       |          | 7/924904           | 862 | AccessShareLock |
t       | t
 relation   |    16424 |     2663 |      |       |            |
 |         |       |          | 7/924904           | 862 | AccessShareLock |
t       | t
 relation   |    16424 |     2678 |      |       |            |
 |         |       |          | 7/924904           | 862 | AccessShareLock |
t       | t
 relation   |    16424 |     2679 |      |       |            |
 |         |       |          | 7/924904           | 862 | AccessShareLock |
t       | t
 relation   |    16424 |     2684 |      |       |            |
 |         |       |          | 7/924904           | 862 | AccessShareLock |
t       | t
 relation   |    16424 |     2685 |      |       |            |
 |         |       |          | 7/924904           | 862 | AccessShareLock |
t       | t
 relation   |    16424 |     3455 |      |       |            |
 |         |       |          | 7/924904           | 862 | AccessShareLock |
t       | t
 relation   |    16424 |    11221 |      |       |            |
 |         |       |          | 7/924904           | 862 | AccessShareLock |
t       | t
 relation   |    16424 |    11236 |      |       |            |
 |         |       |          | 7/924904           | 862 | AccessShareLock |
t       | t
 virtualxid |          |          |      |       | 7/924904   |
 |         |       |          | 7/924904           | 862 | ExclusiveLock   |
t       | t


So not granted an single lock on relation 17987 (just common table inside
the database).

Now lets see whats going on with wal replay:
ps -axuww | grep recover
postgres 26786  2.5  0.0 177620 76784 ?        Ss   Sep30  48:27 postgres:
startup process   recovering 00000001000061A0000000AE waiting
So wal replay waiting for something.

Lets see what locks wal replay have:
postgres=# select * from pg_locks where pid=26786 order by granted,
relation;
  locktype  | database | relation | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction |  pid  |
    mode         | granted | fastpath

------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+---------------------+---------+----------
 relation   |    16424 |    17987 |      |       |            |
 |         |       |          | 1/0                | 26786 |
AccessExclusiveLock | t       | f
 virtualxid |          |          |      |       | 1/1        |
 |         |       |          | 1/0                | 26786 | ExclusiveLock
    | t       | t
(2 rows)


So wal replay got AccessExclusiveLock on relation 17987 and waiting for
something.
And query waiting for AccessShareLock on the same relation.

Once I killed waiting query, wal replay resume with no issues.
So seems there are some king of undetected deadlock condition between query
and wal replay.

PS: situation reproduced every few hours so it's possible to debug but I
have no good idea where to start.

Kind Regards,
Maksym

pgsql-bugs by date:

Previous
From: swingi@gmail.com
Date:
Subject: BUG #13656: table inheritance, pg_dump emits same constraint for all inheritors causing errors
Next
From: Maxim Boguk
Date:
Subject: Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.