Thread: [BUGS] Connections hang indefinitely while taking a LWLockTranchebuffer_content lock.

Hi there -

I'm encountering a locking issue in Postgres. I unfortunately haven't
been able to make a reproduction for it, but it's occurred a few times
on different machines, and I have access to two machines that are
currently locked up, so I'm hoping that someone can help anyway.

Postgres version: "PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled
by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4, 64-bit"

The issue is that, during an INSERT to an unlogged table, the
inserting connection gets frozen and never exits. When I query
pg_stat_activity I see that it is holding open a "LWLockTranche"
"buffer_content" lock. These connections can't be killed by
pg_terminate_backend() or pg_cancel_backend() (those functions return
true but the entries in pg_stat_activity remain), and they persist
long after the client application has exited - there were some I found
on a staging server that were a month old, and were only resolved by
restarting the Postgres instance. Their transactions are held open
also.

Example contents of pg_stat_activity:

```
SELECT pid, wait_event_type, wait_event, state, backend_xid,
backend_xmin, backend_start, xact_start, query_start, state_change,
query FROM pg_stat_activity ORDER BY xact_start
```
```
6840;"LWLockTranche";"buffer_content";"active";11092;11092;"2017-10-18
14:53:01.224955+00";"2017-10-18 14:53:01.243026+00";"2017-10-18
14:53:01.243148+00";"2017-10-18 14:53:01.243149+00";"INSERT INTO
public.que_lockers ( pid, worker_count, worker_priorities, ruby_pid,
ruby_hostname, listening, queues ) VALUES ( pg_backend_pid(),
$1::integer, $2::integer[], $3::integer, $4::text, $5::boolean,
$6::text[] )"
30522;"LWLockTranche";"buffer_content";"active";11093;11092;"2017-10-19
15:21:24.692055+00";"2017-10-19 15:21:24.720159+00";"2017-10-19
15:21:24.720272+00";"2017-10-19 15:21:24.720273+00";"INSERT INTO
public.que_lockers ( pid, worker_count, worker_priorities, ruby_pid,
ruby_hostname, listening, queues ) VALUES ( pg_backend_pid(),
$1::integer, $2::integer[], $3::integer, $4::text, $5::boolean,
$6::text[] )"
628;"LWLockTranche";"buffer_content";"active";11094;11092;"2017-10-19
18:00:51.299233+00";"2017-10-19 18:00:51.329217+00";"2017-10-19
18:00:51.329333+00";"2017-10-19 18:00:51.329334+00";"INSERT INTO
public.que_lockers ( pid, worker_count, worker_priorities, ruby_pid,
ruby_hostname, listening, queues ) VALUES ( pg_backend_pid(),
$1::integer, $2::integer[], $3::integer, $4::text, $5::boolean,
$6::text[] )"
23871;"LWLockTranche";"buffer_content";"active";11095;11092;"2017-10-20
18:08:01.292776+00";"2017-10-20 18:08:01.32107+00";"2017-10-20
18:08:01.32119+00";"2017-10-20 18:08:01.321191+00";"INSERT INTO
public.que_lockers ( pid, worker_count, worker_priorities, ruby_pid,
ruby_hostname, listening, queues ) VALUES ( pg_backend_pid(),
$1::integer, $2::integer[], $3::integer, $4::text, $5::boolean,
$6::text[] )"
25825;"";"";"active";;11092;"2017-10-20
20:10:26.865863+00";"2017-10-20 20:15:33.682626+00";"2017-10-20
20:15:33.682626+00";"2017-10-20 20:15:33.682629+00";"SELECT pid,
wait_event_type, wait_event, state, backend_xid, backend_xmin,
backend_start, xact_start, query_start, state_change, query FROM
pg_stat_activity ORDER BY xact_start"
25823;"";"";"idle";;;"2017-10-20 20:10:17.70007+00";"";"2017-10-20
20:10:26.041268+00";"2017-10-20 20:10:26.042296+00";"SELECT 1 FROM
pg_available_extensions WHERE name='adminpack'"
23868;"";"";"idle";;;"2017-10-20 18:07:57.609124+00";"";"2017-10-20
18:08:01.241658+00";"2017-10-20 18:08:01.241927+00";"SELECT "id",
"pg_subdomain", "custom_domain" FROM "public"."companies""
```

The table schema looks like this (copy-pasted from PgAdmin 3):

```
-- Table: public.que_lockers

-- DROP TABLE public.que_lockers;

CREATE UNLOGGED TABLE public.que_lockers
( pid integer NOT NULL, worker_count integer NOT NULL, worker_priorities integer[] NOT NULL, ruby_pid integer NOT NULL,
ruby_hostnametext NOT NULL, queues text[] NOT NULL, listening boolean NOT NULL, CONSTRAINT que_lockers_pkey PRIMARY KEY
(pid),CONSTRAINT valid_queues CHECK (array_ndims(queues) = 1 AND
 
array_length(queues, 1) IS NOT NULL), CONSTRAINT valid_worker_priorities CHECK
(array_ndims(worker_priorities) = 1 AND
array_length(worker_priorities, 1) IS NOT NULL)
)
WITH ( OIDS=FALSE
);
ALTER TABLE public.que_lockers OWNER TO u4eo9shu4pp52d;

-- Index: public.que_lockers_listening_queues_idx

-- DROP INDEX public.que_lockers_listening_queues_idx;

CREATE INDEX que_lockers_listening_queues_idx ON public.que_lockers USING gin (queues COLLATE pg_catalog."default")
WHERElistening;
 
```

The inserted values don't appear in the pg_stat_activity output
(beyond pid being pg_backend_pid()), but they should be:
worker_count = 6
worker_priorities = '{10,30,50,NULL,NULL,NULL}'
ruby_pid = (the pid of the client process)
ruby_hostname = (the hostname of the client process, in this case a UUID)
listening = true
queues = '{default}'

The following setup was done for each connection after they were established:
```
SET standard_conforming_strings = ON
SET client_min_messages = 'WARNING'
SET search_path = "public"
SET DateStyle = 'ISO'
SET idle_in_transaction_session_timeout TO '60s'
SET SESSION TIME ZONE 'UTC'
```

Unfortunately the databases I've been able to reproduce this on are
hosted by Heroku, so I don't have direct access to the machines for
debugging purposes. Someone in the #postgresql IRC channel suggested
checking what the backend processes was doing with strace - I asked
Heroku Support to check this and they responded that the processes
were waiting on `semop` calls, but that according to `ipcs -s` there
aren't any currently active. If anyone has any more requests for
information to be retrieved from a machine with a stuck process like
this I'm happy to relay them.

I hope this is enough to go on. If anyone would like any more
information please let me know and I'll do my best to investigate.

Chris


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Chris Hanks <christopher.m.hanks@gmail.com> writes:
> I'm encountering a locking issue in Postgres. I unfortunately haven't
> been able to make a reproduction for it, but it's occurred a few times
> on different machines, and I have access to two machines that are
> currently locked up, so I'm hoping that someone can help anyway.

If you could get stack traces from the stuck backends, that would be
pretty helpful.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Alright, thanks, I'll ask them to look into that first thing tomorrow
morning. They may not get back to me for a day or so, so if anyone has
anything else they'd like me to ask for in the same request, please
let me know!

Chris

On Sun, Oct 22, 2017 at 2:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Chris Hanks <christopher.m.hanks@gmail.com> writes:
>> I'm encountering a locking issue in Postgres. I unfortunately haven't
>> been able to make a reproduction for it, but it's occurred a few times
>> on different machines, and I have access to two machines that are
>> currently locked up, so I'm hoping that someone can help anyway.
>
> If you could get stack traces from the stuck backends, that would be
> pretty helpful.
>
>                         regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Hi -

So it seems that Heroku support can't currently attach to the lxc
container running the 9.6.5 DB. They were able to get a stacktrace
from another DB experiencing the same issue that's running 9.6.1 (I
had started up the 9.6.5 instance hoping that bumping the minor
version would fix this). This instance has two stuck processes.
pg_stat_activity looks like:

```
Query:

SELECT pid, wait_event_type, wait_event, state, backend_xid,
backend_xmin, backend_start, xact_start, query_start, state_change,
query FROM pg_stat_activity ORDER BY xact_start

Results:

22;"LWLockTranche";"buffer_content";"active";11091;11091;"2017-10-17
21:03:22.975601+00";"2017-10-17 21:03:23.001133+00";"2017-10-17
21:03:23.001253+00";"2017-10-17 21:03:23.001255+00";"INSERT INTO
public.que_lockers ( pid, worker_count, worker_priorities, ruby_pid,
ruby_hostname, listening, queues ) VALUES ( pg_backend_pid(),
$1::integer, $2::integer[], $3::integer, $4::text, $5::boolean,
$6::text[] )"
16987;"LWLockTranche";"buffer_content";"active";11092;11091;"2017-10-18
14:31:45.116391+00";"2017-10-18 14:31:45.133445+00";"2017-10-18
14:31:45.133585+00";"2017-10-18 14:31:45.133586+00";"INSERT INTO
public.que_lockers ( pid, worker_count, worker_priorities, ruby_pid,
ruby_hostname, listening, queues ) VALUES ( pg_backend_pid(),
$1::integer, $2::integer[], $3::integer, $4::text, $5::boolean,
$6::text[] )"
5502;"";"";"active";;11091;"2017-10-23 16:31:10.541227+00";"2017-10-23
16:31:15.909758+00";"2017-10-23 16:31:15.909758+00";"2017-10-23
16:31:15.909761+00";"SELECT pid, wait_event_type, wait_event, state,
backend_xid, backend_xmin, backend_start, xact_start, query_start,
state_change, query FROM pg_stat_activity ORDER BY xact_start"
20;"";"";"";;;"";"";"";"";"<insufficient privilege>"
5501;"";"";"idle";;;"2017-10-23 16:31:08.627896+00";"";"2017-10-23
16:31:09.089391+00";"2017-10-23 16:31:09.090431+00";"SELECT 1 FROM
pg_available_extensions WHERE name='adminpack'"
```

The currently held locks:
```
Query

SELECT db.datname, locktype, relation::regclass, mode, transactionid
AS tid, virtualtransaction AS vtid, pid, granted
FROM pg_catalog.pg_locks l
LEFT JOIN pg_catalog.pg_database db ON db.oid = l.database
WHERE NOT pid = pg_backend_pid();

Results
   datname     |   locktype    |             relation             |   mode       |  tid  | vtid  |  pid  | granted

----------------+---------------+----------------------------------+------------------+-------+-------+-------+---------dfkjlo7osvu4dh
|relation      | que_lockers_listening_queues_idx |
 
RowExclusiveLock |       | 8/642 | 16987 | tdfkjlo7osvu4dh | relation      | que_lockers_pkey                 |
RowExclusiveLock |       | 8/642 | 16987 | tdfkjlo7osvu4dh | relation      | que_lockers                      |
RowExclusiveLock |       | 8/642 | 16987 | t               | virtualxid    |                                  |
ExclusiveLock    |       | 8/642 | 16987 | tdfkjlo7osvu4dh | relation      | que_lockers_listening_queues_idx |
RowExclusiveLock |       | 5/10  |    22 | tdfkjlo7osvu4dh | relation      | que_lockers_pkey                 |
RowExclusiveLock |       | 5/10  |    22 | tdfkjlo7osvu4dh | relation      | que_lockers                      |
RowExclusiveLock |       | 5/10  |    22 | t               | virtualxid    |                                  |
ExclusiveLock    |       | 5/10  |    22 | t               | transactionid |                                  |
ExclusiveLock    | 11092 | 8/642 | 16987 | t               | transactionid |                                  |
ExclusiveLock    | 11091 | 5/10  |    22 | t
(10 rows)
```

The stacktraces are:

```
Backtrace of query with PID 22:

#0  0x00007fa23544ed87 in write_call_graph (fd=<optimized out>) at gmon.c:287
#1  write_gmon () at gmon.c:394
#2  0x00005639548bf72b in LWLockAcquire (lock=0xffff0001,
mode=LW_EXCLUSIVE) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/storage/lmgr/lwlock.c:1287
#3  0x00005639546707c1 in ginHeapTupleFastInsert
(ginstate=ginstate@entry=0x7ffef442af00,
collector=collector@entry=0x7ffef442aee0)   at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/access/gin/ginfast.c:314
#4  0x0000563954663985 in gininsert (index=<optimized out>,
values=0x7ffef442d590, isnull=0x7ffef442d690 "",
ht_ctid=0x5639555dfb64, heapRel=<optimized out>,
checkUnique=<optimized out>)   at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/access/gin/gininsert.c:512
#5  0x00005639547b202c in ExecInsertIndexTuples
(slot=slot@entry=0x5639555dc3e0, tupleid=tupleid@entry=0x5639555dfb64,
estate=estate@entry=0x5639555dbb40, noDupErr=noDupErr@entry=0 '\000',   specConflict=specConflict@entry=0x0,
arbiterIndexes=arbiterIndexes@entry=0x0) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execIndexing.c:388
#6  0x00005639547d0ffd in ExecInsert (canSetTag=1 '\001',
estate=0x5639555dbb40, onconflict=<optimized out>, arbiterIndexes=0x0,
planSlot=0x5639555dc3e0, slot=0x5639555dc3e0,   mtstate=0x5639555dbde0) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/nodeModifyTable.c:496
#7  ExecModifyTable (node=node@entry=0x5639555dbde0) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/nodeModifyTable.c:1511
#8  0x00005639547b6c38 in ExecProcNode
(node=node@entry=0x5639555dbde0) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execProcnode.c:396
#9  0x00005639547b2cfe in ExecutePlan (dest=0x563954e0a300
<donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=0
'\000', operation=CMD_INSERT,   use_parallel_mode=<optimized out>, planstate=0x5639555dbde0,
estate=0x5639555dbb40) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execMain.c:1567
#10 standard_ExecutorRun (queryDesc=0x5639555cf6a0,
direction=<optimized out>, count=0) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execMain.c:338
#11 0x00007fa22a3b1035 in pgss_ExecutorRun (queryDesc=0x5639555cf6a0,
direction=ForwardScanDirection, count=0)   at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../contrib/pg_stat_statements/pg_stat_statements.c:877
#12 0x00005639548d0549 in ProcessQuery (plan=<optimized out>,   sourceText=0x5639555c6ca0 "INSERT INTO
public.que_lockers( pid,
 
worker_count, worker_priorities, ruby_pid, ruby_hostname, listening,
queues ) VALUES ( pg_backend_pid(), $1::integer, $2::integer[],
$3::integer, $4::text, $5::bo"..., params=0x5639555c6db0,
dest=0x563954e0a300 <donothingDR>, completionTag=0x7ffef442dee0 "")   at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/pquery.c:187
#13 0x00005639548d079e in PortalRunMulti
(portal=portal@entry=0x5639555c0b90, isTopLevel=isTopLevel@entry=1
'\001', setHoldSnapshot=setHoldSnapshot@entry=0 '\000',   dest=0x563954e0a300 <donothingDR>,
dest@entry=0x5639555a5310,
altdest=0x563954e0a300 <donothingDR>, altdest@entry=0x5639555a5310,
completionTag=completionTag@entry=0x7ffef442dee0 "")   at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/pquery.c:1303
#14 0x00005639548d1397 in PortalRun
(portal=portal@entry=0x5639555c0b90,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1
'\001', dest=dest@entry=0x5639555a5310,   altdest=altdest@entry=0x5639555a5310,
completionTag=completionTag@entry=0x7ffef442dee0 "") at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/pquery.c:815
#15 0x00005639548ce643 in exec_execute_message
(max_rows=9223372036854775807, portal_name=0x5639555a4f00 "")   at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/postgres.c:1977
#16 PostgresMain (argc=<optimized out>,
argv=argv@entry=0x563955527678, dbname=0x5639555275a0
"dfkjlo7osvu4dh", username=<optimized out>)   at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/postgres.c:4133
#17 0x00005639546543fb in BackendRun (port=0x56395551dcb0) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:4270
#18 BackendStartup (port=0x56395551dcb0) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:3944
#19 ServerLoop () at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:1701
#20 0x000056395486e06b in PostmasterMain (argc=5, argv=<optimized
out>) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:1309
#21 0x0000563954655142 in main (argc=5, argv=0x5639554d53a0) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/main/main.c:228

Backtrace of query with PID 16987

#0  0x00007fa23544ed87 in write_call_graph (fd=<optimized out>) at gmon.c:287
#1  write_gmon () at gmon.c:394
#2  0x00005639548bf72b in LWLockAcquire (lock=0xffff000f,
mode=LW_EXCLUSIVE) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/storage/lmgr/lwlock.c:1287
#3  0x00005639546704e0 in ginHeapTupleFastInsert
(ginstate=ginstate@entry=0x7ffef442af00,
collector=collector@entry=0x7ffef442aee0)   at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/access/gin/ginfast.c:255
#4  0x0000563954663985 in gininsert (index=<optimized out>,
values=0x7ffef442d590, isnull=0x7ffef442d690 "",
ht_ctid=0x5639555df7e4, heapRel=<optimized out>,
checkUnique=<optimized out>)   at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/access/gin/gininsert.c:512
#5  0x00005639547b202c in ExecInsertIndexTuples
(slot=slot@entry=0x5639555dc060, tupleid=tupleid@entry=0x5639555df7e4,
estate=estate@entry=0x5639555db7c0, noDupErr=noDupErr@entry=0 '\000',   specConflict=specConflict@entry=0x0,
arbiterIndexes=arbiterIndexes@entry=0x0) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execIndexing.c:388
#6  0x00005639547d0ffd in ExecInsert (canSetTag=1 '\001',
estate=0x5639555db7c0, onconflict=<optimized out>, arbiterIndexes=0x0,
planSlot=0x5639555dc060, slot=0x5639555dc060,   mtstate=0x5639555dba60) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/nodeModifyTable.c:496
#7  ExecModifyTable (node=node@entry=0x5639555dba60) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/nodeModifyTable.c:1511
#8  0x00005639547b6c38 in ExecProcNode
(node=node@entry=0x5639555dba60) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execProcnode.c:396
#9  0x00005639547b2cfe in ExecutePlan (dest=0x563954e0a300
<donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=0
'\000', operation=CMD_INSERT,   use_parallel_mode=<optimized out>, planstate=0x5639555dba60,
estate=0x5639555db7c0) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execMain.c:1567
#10 standard_ExecutorRun (queryDesc=0x5639555cf320,
direction=<optimized out>, count=0) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execMain.c:338
#11 0x00007fa22a3b1035 in pgss_ExecutorRun (queryDesc=0x5639555cf320,
direction=ForwardScanDirection, count=0)   at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../contrib/pg_stat_statements/pg_stat_statements.c:877
#12 0x00005639548d0549 in ProcessQuery (plan=<optimized out>,   sourceText=0x5639555c6920 "INSERT INTO
public.que_lockers( pid,
 
worker_count, worker_priorities, ruby_pid, ruby_hostname, listening,
queues ) VALUES ( pg_backend_pid(), $1::integer, $2::integer[],
$3::integer, $4::text, $5::bo"..., params=0x5639555c6a30,
dest=0x563954e0a300 <donothingDR>, completionTag=0x7ffef442dee0 "")   at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/pquery.c:187
#13 0x00005639548d079e in PortalRunMulti
(portal=portal@entry=0x5639555c0810, isTopLevel=isTopLevel@entry=1
'\001', setHoldSnapshot=setHoldSnapshot@entry=0 '\000',   dest=0x563954e0a300 <donothingDR>,
dest@entry=0x5639555a4f90,
altdest=0x563954e0a300 <donothingDR>, altdest@entry=0x5639555a4f90,
completionTag=completionTag@entry=0x7ffef442dee0 "")   at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/pquery.c:1303
#14 0x00005639548d1397 in PortalRun
(portal=portal@entry=0x5639555c0810,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1
'\001', dest=dest@entry=0x5639555a4f90,   altdest=altdest@entry=0x5639555a4f90,
completionTag=completionTag@entry=0x7ffef442dee0 "") at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/pquery.c:815
#15 0x00005639548ce643 in exec_execute_message
(max_rows=9223372036854775807, portal_name=0x5639555a4b80 "")   at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/postgres.c:1977
#16 PostgresMain (argc=<optimized out>,
argv=argv@entry=0x563955527678, dbname=0x5639555275a0
"dfkjlo7osvu4dh", username=<optimized out>)   at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/postgres.c:4133
#17 0x00005639546543fb in BackendRun (port=0x56395551dcb0) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:4270
#18 BackendStartup (port=0x56395551dcb0) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:3944
#19 ServerLoop () at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:1701
#20 0x000056395486e06b in PostmasterMain (argc=5, argv=<optimized
out>) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:1309
#21 0x0000563954655142 in main (argc=5, argv=0x5639554d53a0) at
/build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/main/main.c:228
```

Heroku support isn't sure how to attach to the 9.6.5 DB without
restarting the container, which will mean losing the stuck processes.
We can certainly try to do that and wait for the issue to recur,
though, if we think the 9.6.1 -> 9.6.5 difference is relevant.

Thanks!
Chris


On Sun, Oct 22, 2017 at 3:23 PM, Chris Hanks
<christopher.m.hanks@gmail.com> wrote:
> Alright, thanks, I'll ask them to look into that first thing tomorrow
> morning. They may not get back to me for a day or so, so if anyone has
> anything else they'd like me to ask for in the same request, please
> let me know!
>
> Chris
>
> On Sun, Oct 22, 2017 at 2:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Chris Hanks <christopher.m.hanks@gmail.com> writes:
>>> I'm encountering a locking issue in Postgres. I unfortunately haven't
>>> been able to make a reproduction for it, but it's occurred a few times
>>> on different machines, and I have access to two machines that are
>>> currently locked up, so I'm hoping that someone can help anyway.
>>
>> If you could get stack traces from the stuck backends, that would be
>> pretty helpful.
>>
>>                         regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Chris Hanks <christopher.m.hanks@gmail.com> writes:
> Backtrace of query with PID 22:

> #0  0x00007fa23544ed87 in write_call_graph (fd=<optimized out>) at gmon.c:287
> #1  write_gmon () at gmon.c:394
> #2  0x00005639548bf72b in LWLockAcquire (lock=0xffff0001,
> mode=LW_EXCLUSIVE) at
> /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/storage/lmgr/lwlock.c:1287
> #3  0x00005639546707c1 in ginHeapTupleFastInsert
> (ginstate=ginstate@entry=0x7ffef442af00,
> collector=collector@entry=0x7ffef442aee0)
>     at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/access/gin/ginfast.c:314

> Backtrace of query with PID 16987

> #0  0x00007fa23544ed87 in write_call_graph (fd=<optimized out>) at gmon.c:287
> #1  write_gmon () at gmon.c:394
> #2  0x00005639548bf72b in LWLockAcquire (lock=0xffff000f,
> mode=LW_EXCLUSIVE) at
> /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/storage/lmgr/lwlock.c:1287
> #3  0x00005639546704e0 in ginHeapTupleFastInsert
> (ginstate=ginstate@entry=0x7ffef442af00,
> collector=collector@entry=0x7ffef442aee0)


So it looks like ginHeapTupleFastInsert is doing something that carries a
risk of deadlocking on two or more LWLocks, if multiple processes try to
insert index entries concurrently.  There's no deadlock detection in the
LWLock infrastructure --- that's one of the things that makes it
"lightweight" locks --- so callers have to avoid such coding patterns.

This is probably enough info for us to identify the exact cause.  But
I don't have time to look right this minute, and am not the hacker most
familiar with the GIN infrastructure anyway.  Anyone want to look for
the bug?
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

On Mon, Oct 23, 2017 at 9:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This is probably enough info for us to identify the exact cause.  But
> I don't have time to look right this minute, and am not the hacker most
> familiar with the GIN infrastructure anyway.  Anyone want to look for
> the bug?

I don't know very much about GIN, but this does look interesting. I'll
try to get to it later in the week; I have meetings over the next
couple of days.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

I'm not sure if you guys had noticed this already, but just in case,
those two hung connections weren't making the inserts at exactly the
same time - if you look at pg_stat_activity they executed about a day
apart (since Heroku cycles the clients every 24 hours or so). And
before I restarted the first DB that experienced the problem, there
were ~30 backends built up over the course of a month. It seems like
when one INSERT sticks, every following INSERT just stacks up on top
of it, trying to take out the same lock.

This isn't to say that a race condition didn't cause the issue in the
first place (it may have, given how difficult it is to reproduce). But
it might have been a read using the index or something like that. From
my knowledge of the usage patterns for the app, I believe that only a
single INSERT should have been taking place at that time - the INSERT
only happens when a worker process starts up, there was only one
worker process for the app, and the worker processes were only
restarted every 24 hours. Also I'd expect this table to be empty when
the INSERT took place (it's an unlogged table that tracks which
connections belong to clients that are ready to pull from a queue).

Also, the only time the index would be used, to my knowledge, is in a
trigger on another table when a row is inserted. It tries to
pseudorandomly pick a client that's ready for work:

```   SELECT pid   INTO locker_pid   FROM (     SELECT *, last_value(row_number) OVER () + 1 AS count     FROM (
SELECT*, row_number() OVER () - 1 AS row_number       FROM (         SELECT *         FROM public.que_lockers ql,
generate_series(1,ql.worker_count) AS id         WHERE listening AND queues @> ARRAY[NEW.queue]         ORDER BY
md5(pid::text|| id::text)       ) t1     ) t2   ) t3   WHERE NEW.id % count = row_number;
 
```

Chris

On Mon, Oct 23, 2017 at 12:55 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Oct 23, 2017 at 9:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This is probably enough info for us to identify the exact cause.  But
>> I don't have time to look right this minute, and am not the hacker most
>> familiar with the GIN infrastructure anyway.  Anyone want to look for
>> the bug?
>
> I don't know very much about GIN, but this does look interesting. I'll
> try to get to it later in the week; I have meetings over the next
> couple of days.
>
> --
> Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

I managed to get a couple of hours to look at this this afternoon.

On Mon, Oct 23, 2017 at 10:39 AM, Chris Hanks
<christopher.m.hanks@gmail.com> wrote:
> I'm not sure if you guys had noticed this already, but just in case,
> those two hung connections weren't making the inserts at exactly the
> same time - if you look at pg_stat_activity they executed about a day
> apart (since Heroku cycles the clients every 24 hours or so). And
> before I restarted the first DB that experienced the problem, there
> were ~30 backends built up over the course of a month. It seems like
> when one INSERT sticks, every following INSERT just stacks up on top
> of it, trying to take out the same lock.

Right. In both backtraces, we see that we're an inserter stuck on
getting an exclusive buffer lock on the buffer containing block 0, the
meta page block (GIN_METAPAGE_BLKNO). There is probably some
session/backend that has acquired two buffer locks in an order that is
inconsistent with these inserters, meaning that you get an
undetectable deadlock. (The only alternative to that theory is that
some backend sits on a meta page buffer lock for some other reason,
but that seems much less likely.)

The interesting question to my mind is: What backend is the other
backend that acquires buffer locks in an incompatible order, resulting
in this undetectable deadlock? What's it doing? (I worked for Heroku
until quite recently; I may be able to open a back channel, with
Chris' permission.)

I remember expressing befuddlement about commit e95680832854c over a
year ago, and never satisfying myself that it was correct [1]. I'm no
expert on GIN, so I dropped it. It feels like that might be relevant
here, since you seem to be using this GIN index with a queue table.
That usage pattern is one where entire leaf pages in indexes tend to
be routinely deleted and later recycled by VACUUM, at least with
B-Trees [2]. Whereas, in general I think B-Tree (and presumably GIN)
page deletion is fairly rare, since the entire page must be empty for
it to happen.

The follow up bugfix commit, e2c79e14, added a ConditionalLockPage()
to the insert ginInsertCleanup() path, while also adding a LockPage()
to the VACUUM path. In case you missed it, those are *heavyweight*
page lock acquisitions, not buffer lock acquisitions, which is pretty
unconventional (I though only hash still did that). Frankly, the
e2c79e14 fix seems kind of bolted on (though I don't want to focus on
that aspect right now).

[1] https://postgr.es/m/CAM3SWZSDxqDBvUGOoNm0veVOwgJV3GDvoncYr6f5L16qo8MYRg@mail.gmail.com
[2] https://brandur.org/postgres-queues
-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

I wound up working around the issue by forking the database and
removing that GIN index, and things have been fine with the new
instance for the past two days.

I previously had two Postgres instances with hung processes, one 9.6.1
and one 9.6.5. For work reasons I destroyed the 9.6.5 (Heroku support
was having issues with its container anyway) but I've kept the 9.6.1.
I'm happy to give permission for you to access it, just let me know
who to talk to, or have them reach out to me. I have an open issue in
their tracker under my work email, chris@pathgather.com.

Thanks again!

On Thu, Oct 26, 2017 at 7:14 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I managed to get a couple of hours to look at this this afternoon.
>
> On Mon, Oct 23, 2017 at 10:39 AM, Chris Hanks
> <christopher.m.hanks@gmail.com> wrote:
>> I'm not sure if you guys had noticed this already, but just in case,
>> those two hung connections weren't making the inserts at exactly the
>> same time - if you look at pg_stat_activity they executed about a day
>> apart (since Heroku cycles the clients every 24 hours or so). And
>> before I restarted the first DB that experienced the problem, there
>> were ~30 backends built up over the course of a month. It seems like
>> when one INSERT sticks, every following INSERT just stacks up on top
>> of it, trying to take out the same lock.
>
> Right. In both backtraces, we see that we're an inserter stuck on
> getting an exclusive buffer lock on the buffer containing block 0, the
> meta page block (GIN_METAPAGE_BLKNO). There is probably some
> session/backend that has acquired two buffer locks in an order that is
> inconsistent with these inserters, meaning that you get an
> undetectable deadlock. (The only alternative to that theory is that
> some backend sits on a meta page buffer lock for some other reason,
> but that seems much less likely.)
>
> The interesting question to my mind is: What backend is the other
> backend that acquires buffer locks in an incompatible order, resulting
> in this undetectable deadlock? What's it doing? (I worked for Heroku
> until quite recently; I may be able to open a back channel, with
> Chris' permission.)
>
> I remember expressing befuddlement about commit e95680832854c over a
> year ago, and never satisfying myself that it was correct [1]. I'm no
> expert on GIN, so I dropped it. It feels like that might be relevant
> here, since you seem to be using this GIN index with a queue table.
> That usage pattern is one where entire leaf pages in indexes tend to
> be routinely deleted and later recycled by VACUUM, at least with
> B-Trees [2]. Whereas, in general I think B-Tree (and presumably GIN)
> page deletion is fairly rare, since the entire page must be empty for
> it to happen.
>
> The follow up bugfix commit, e2c79e14, added a ConditionalLockPage()
> to the insert ginInsertCleanup() path, while also adding a LockPage()
> to the VACUUM path. In case you missed it, those are *heavyweight*
> page lock acquisitions, not buffer lock acquisitions, which is pretty
> unconventional (I though only hash still did that). Frankly, the
> e2c79e14 fix seems kind of bolted on (though I don't want to focus on
> that aspect right now).
>
> [1] https://postgr.es/m/CAM3SWZSDxqDBvUGOoNm0veVOwgJV3GDvoncYr6f5L16qo8MYRg@mail.gmail.com
> [2] https://brandur.org/postgres-queues
> --
> Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Hey, I realize it's a bit late, but did anybody ever find a root cause
for this? I removed the relevant index and haven't had an issue since.

Thanks,
Chris

On Thu, Oct 26, 2017 at 7:42 PM, Chris Hanks
<christopher.m.hanks@gmail.com> wrote:
> I wound up working around the issue by forking the database and
> removing that GIN index, and things have been fine with the new
> instance for the past two days.
>
> I previously had two Postgres instances with hung processes, one 9.6.1
> and one 9.6.5. For work reasons I destroyed the 9.6.5 (Heroku support
> was having issues with its container anyway) but I've kept the 9.6.1.
> I'm happy to give permission for you to access it, just let me know
> who to talk to, or have them reach out to me. I have an open issue in
> their tracker under my work email, chris@pathgather.com.
>
> Thanks again!
>
> On Thu, Oct 26, 2017 at 7:14 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I managed to get a couple of hours to look at this this afternoon.
>>
>> On Mon, Oct 23, 2017 at 10:39 AM, Chris Hanks
>> <christopher.m.hanks@gmail.com> wrote:
>>> I'm not sure if you guys had noticed this already, but just in case,
>>> those two hung connections weren't making the inserts at exactly the
>>> same time - if you look at pg_stat_activity they executed about a day
>>> apart (since Heroku cycles the clients every 24 hours or so). And
>>> before I restarted the first DB that experienced the problem, there
>>> were ~30 backends built up over the course of a month. It seems like
>>> when one INSERT sticks, every following INSERT just stacks up on top
>>> of it, trying to take out the same lock.
>>
>> Right. In both backtraces, we see that we're an inserter stuck on
>> getting an exclusive buffer lock on the buffer containing block 0, the
>> meta page block (GIN_METAPAGE_BLKNO). There is probably some
>> session/backend that has acquired two buffer locks in an order that is
>> inconsistent with these inserters, meaning that you get an
>> undetectable deadlock. (The only alternative to that theory is that
>> some backend sits on a meta page buffer lock for some other reason,
>> but that seems much less likely.)
>>
>> The interesting question to my mind is: What backend is the other
>> backend that acquires buffer locks in an incompatible order, resulting
>> in this undetectable deadlock? What's it doing? (I worked for Heroku
>> until quite recently; I may be able to open a back channel, with
>> Chris' permission.)
>>
>> I remember expressing befuddlement about commit e95680832854c over a
>> year ago, and never satisfying myself that it was correct [1]. I'm no
>> expert on GIN, so I dropped it. It feels like that might be relevant
>> here, since you seem to be using this GIN index with a queue table.
>> That usage pattern is one where entire leaf pages in indexes tend to
>> be routinely deleted and later recycled by VACUUM, at least with
>> B-Trees [2]. Whereas, in general I think B-Tree (and presumably GIN)
>> page deletion is fairly rare, since the entire page must be empty for
>> it to happen.
>>
>> The follow up bugfix commit, e2c79e14, added a ConditionalLockPage()
>> to the insert ginInsertCleanup() path, while also adding a LockPage()
>> to the VACUUM path. In case you missed it, those are *heavyweight*
>> page lock acquisitions, not buffer lock acquisitions, which is pretty
>> unconventional (I though only hash still did that). Frankly, the
>> e2c79e14 fix seems kind of bolted on (though I don't want to focus on
>> that aspect right now).
>>
>> [1] https://postgr.es/m/CAM3SWZSDxqDBvUGOoNm0veVOwgJV3GDvoncYr6f5L16qo8MYRg@mail.gmail.com
>> [2] https://brandur.org/postgres-queues
>> --
>> Peter Geoghegan