Thread: [BUGS] Connections hang indefinitely while taking a LWLockTranchebuffer_content lock.
[BUGS] Connections hang indefinitely while taking a LWLockTranchebuffer_content lock.
From
Chris Hanks
Date:
Hi there - I'm encountering a locking issue in Postgres. I unfortunately haven't been able to make a reproduction for it, but it's occurred a few times on different machines, and I have access to two machines that are currently locked up, so I'm hoping that someone can help anyway. Postgres version: "PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4, 64-bit" The issue is that, during an INSERT to an unlogged table, the inserting connection gets frozen and never exits. When I query pg_stat_activity I see that it is holding open a "LWLockTranche" "buffer_content" lock. These connections can't be killed by pg_terminate_backend() or pg_cancel_backend() (those functions return true but the entries in pg_stat_activity remain), and they persist long after the client application has exited - there were some I found on a staging server that were a month old, and were only resolved by restarting the Postgres instance. Their transactions are held open also. Example contents of pg_stat_activity: ``` SELECT pid, wait_event_type, wait_event, state, backend_xid, backend_xmin, backend_start, xact_start, query_start, state_change, query FROM pg_stat_activity ORDER BY xact_start ``` ``` 6840;"LWLockTranche";"buffer_content";"active";11092;11092;"2017-10-18 14:53:01.224955+00";"2017-10-18 14:53:01.243026+00";"2017-10-18 14:53:01.243148+00";"2017-10-18 14:53:01.243149+00";"INSERT INTO public.que_lockers ( pid, worker_count, worker_priorities, ruby_pid, ruby_hostname, listening, queues ) VALUES ( pg_backend_pid(), $1::integer, $2::integer[], $3::integer, $4::text, $5::boolean, $6::text[] )" 30522;"LWLockTranche";"buffer_content";"active";11093;11092;"2017-10-19 15:21:24.692055+00";"2017-10-19 15:21:24.720159+00";"2017-10-19 15:21:24.720272+00";"2017-10-19 15:21:24.720273+00";"INSERT INTO public.que_lockers ( pid, worker_count, worker_priorities, ruby_pid, ruby_hostname, listening, queues ) VALUES ( pg_backend_pid(), $1::integer, $2::integer[], $3::integer, $4::text, $5::boolean, $6::text[] )" 628;"LWLockTranche";"buffer_content";"active";11094;11092;"2017-10-19 18:00:51.299233+00";"2017-10-19 18:00:51.329217+00";"2017-10-19 18:00:51.329333+00";"2017-10-19 18:00:51.329334+00";"INSERT INTO public.que_lockers ( pid, worker_count, worker_priorities, ruby_pid, ruby_hostname, listening, queues ) VALUES ( pg_backend_pid(), $1::integer, $2::integer[], $3::integer, $4::text, $5::boolean, $6::text[] )" 23871;"LWLockTranche";"buffer_content";"active";11095;11092;"2017-10-20 18:08:01.292776+00";"2017-10-20 18:08:01.32107+00";"2017-10-20 18:08:01.32119+00";"2017-10-20 18:08:01.321191+00";"INSERT INTO public.que_lockers ( pid, worker_count, worker_priorities, ruby_pid, ruby_hostname, listening, queues ) VALUES ( pg_backend_pid(), $1::integer, $2::integer[], $3::integer, $4::text, $5::boolean, $6::text[] )" 25825;"";"";"active";;11092;"2017-10-20 20:10:26.865863+00";"2017-10-20 20:15:33.682626+00";"2017-10-20 20:15:33.682626+00";"2017-10-20 20:15:33.682629+00";"SELECT pid, wait_event_type, wait_event, state, backend_xid, backend_xmin, backend_start, xact_start, query_start, state_change, query FROM pg_stat_activity ORDER BY xact_start" 25823;"";"";"idle";;;"2017-10-20 20:10:17.70007+00";"";"2017-10-20 20:10:26.041268+00";"2017-10-20 20:10:26.042296+00";"SELECT 1 FROM pg_available_extensions WHERE name='adminpack'" 23868;"";"";"idle";;;"2017-10-20 18:07:57.609124+00";"";"2017-10-20 18:08:01.241658+00";"2017-10-20 18:08:01.241927+00";"SELECT "id", "pg_subdomain", "custom_domain" FROM "public"."companies"" ``` The table schema looks like this (copy-pasted from PgAdmin 3): ``` -- Table: public.que_lockers -- DROP TABLE public.que_lockers; CREATE UNLOGGED TABLE public.que_lockers ( pid integer NOT NULL, worker_count integer NOT NULL, worker_priorities integer[] NOT NULL, ruby_pid integer NOT NULL, ruby_hostnametext NOT NULL, queues text[] NOT NULL, listening boolean NOT NULL, CONSTRAINT que_lockers_pkey PRIMARY KEY (pid),CONSTRAINT valid_queues CHECK (array_ndims(queues) = 1 AND array_length(queues, 1) IS NOT NULL), CONSTRAINT valid_worker_priorities CHECK (array_ndims(worker_priorities) = 1 AND array_length(worker_priorities, 1) IS NOT NULL) ) WITH ( OIDS=FALSE ); ALTER TABLE public.que_lockers OWNER TO u4eo9shu4pp52d; -- Index: public.que_lockers_listening_queues_idx -- DROP INDEX public.que_lockers_listening_queues_idx; CREATE INDEX que_lockers_listening_queues_idx ON public.que_lockers USING gin (queues COLLATE pg_catalog."default") WHERElistening; ``` The inserted values don't appear in the pg_stat_activity output (beyond pid being pg_backend_pid()), but they should be: worker_count = 6 worker_priorities = '{10,30,50,NULL,NULL,NULL}' ruby_pid = (the pid of the client process) ruby_hostname = (the hostname of the client process, in this case a UUID) listening = true queues = '{default}' The following setup was done for each connection after they were established: ``` SET standard_conforming_strings = ON SET client_min_messages = 'WARNING' SET search_path = "public" SET DateStyle = 'ISO' SET idle_in_transaction_session_timeout TO '60s' SET SESSION TIME ZONE 'UTC' ``` Unfortunately the databases I've been able to reproduce this on are hosted by Heroku, so I don't have direct access to the machines for debugging purposes. Someone in the #postgresql IRC channel suggested checking what the backend processes was doing with strace - I asked Heroku Support to check this and they responded that the processes were waiting on `semop` calls, but that according to `ipcs -s` there aren't any currently active. If anyone has any more requests for information to be retrieved from a machine with a stuck process like this I'm happy to relay them. I hope this is enough to go on. If anyone would like any more information please let me know and I'll do my best to investigate. Chris -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Connections hang indefinitely while taking a LWLockTranche buffer_content lock.
From
Tom Lane
Date:
Chris Hanks <christopher.m.hanks@gmail.com> writes: > I'm encountering a locking issue in Postgres. I unfortunately haven't > been able to make a reproduction for it, but it's occurred a few times > on different machines, and I have access to two machines that are > currently locked up, so I'm hoping that someone can help anyway. If you could get stack traces from the stuck backends, that would be pretty helpful. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Connections hang indefinitely while taking a LWLockTranchebuffer_content lock.
From
Chris Hanks
Date:
Alright, thanks, I'll ask them to look into that first thing tomorrow morning. They may not get back to me for a day or so, so if anyone has anything else they'd like me to ask for in the same request, please let me know! Chris On Sun, Oct 22, 2017 at 2:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Chris Hanks <christopher.m.hanks@gmail.com> writes: >> I'm encountering a locking issue in Postgres. I unfortunately haven't >> been able to make a reproduction for it, but it's occurred a few times >> on different machines, and I have access to two machines that are >> currently locked up, so I'm hoping that someone can help anyway. > > If you could get stack traces from the stuck backends, that would be > pretty helpful. > > regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Connections hang indefinitely while taking a LWLockTranchebuffer_content lock.
From
Chris Hanks
Date:
Hi - So it seems that Heroku support can't currently attach to the lxc container running the 9.6.5 DB. They were able to get a stacktrace from another DB experiencing the same issue that's running 9.6.1 (I had started up the 9.6.5 instance hoping that bumping the minor version would fix this). This instance has two stuck processes. pg_stat_activity looks like: ``` Query: SELECT pid, wait_event_type, wait_event, state, backend_xid, backend_xmin, backend_start, xact_start, query_start, state_change, query FROM pg_stat_activity ORDER BY xact_start Results: 22;"LWLockTranche";"buffer_content";"active";11091;11091;"2017-10-17 21:03:22.975601+00";"2017-10-17 21:03:23.001133+00";"2017-10-17 21:03:23.001253+00";"2017-10-17 21:03:23.001255+00";"INSERT INTO public.que_lockers ( pid, worker_count, worker_priorities, ruby_pid, ruby_hostname, listening, queues ) VALUES ( pg_backend_pid(), $1::integer, $2::integer[], $3::integer, $4::text, $5::boolean, $6::text[] )" 16987;"LWLockTranche";"buffer_content";"active";11092;11091;"2017-10-18 14:31:45.116391+00";"2017-10-18 14:31:45.133445+00";"2017-10-18 14:31:45.133585+00";"2017-10-18 14:31:45.133586+00";"INSERT INTO public.que_lockers ( pid, worker_count, worker_priorities, ruby_pid, ruby_hostname, listening, queues ) VALUES ( pg_backend_pid(), $1::integer, $2::integer[], $3::integer, $4::text, $5::boolean, $6::text[] )" 5502;"";"";"active";;11091;"2017-10-23 16:31:10.541227+00";"2017-10-23 16:31:15.909758+00";"2017-10-23 16:31:15.909758+00";"2017-10-23 16:31:15.909761+00";"SELECT pid, wait_event_type, wait_event, state, backend_xid, backend_xmin, backend_start, xact_start, query_start, state_change, query FROM pg_stat_activity ORDER BY xact_start" 20;"";"";"";;;"";"";"";"";"<insufficient privilege>" 5501;"";"";"idle";;;"2017-10-23 16:31:08.627896+00";"";"2017-10-23 16:31:09.089391+00";"2017-10-23 16:31:09.090431+00";"SELECT 1 FROM pg_available_extensions WHERE name='adminpack'" ``` The currently held locks: ``` Query SELECT db.datname, locktype, relation::regclass, mode, transactionid AS tid, virtualtransaction AS vtid, pid, granted FROM pg_catalog.pg_locks l LEFT JOIN pg_catalog.pg_database db ON db.oid = l.database WHERE NOT pid = pg_backend_pid(); Results datname | locktype | relation | mode | tid | vtid | pid | granted ----------------+---------------+----------------------------------+------------------+-------+-------+-------+---------dfkjlo7osvu4dh |relation | que_lockers_listening_queues_idx | RowExclusiveLock | | 8/642 | 16987 | tdfkjlo7osvu4dh | relation | que_lockers_pkey | RowExclusiveLock | | 8/642 | 16987 | tdfkjlo7osvu4dh | relation | que_lockers | RowExclusiveLock | | 8/642 | 16987 | t | virtualxid | | ExclusiveLock | | 8/642 | 16987 | tdfkjlo7osvu4dh | relation | que_lockers_listening_queues_idx | RowExclusiveLock | | 5/10 | 22 | tdfkjlo7osvu4dh | relation | que_lockers_pkey | RowExclusiveLock | | 5/10 | 22 | tdfkjlo7osvu4dh | relation | que_lockers | RowExclusiveLock | | 5/10 | 22 | t | virtualxid | | ExclusiveLock | | 5/10 | 22 | t | transactionid | | ExclusiveLock | 11092 | 8/642 | 16987 | t | transactionid | | ExclusiveLock | 11091 | 5/10 | 22 | t (10 rows) ``` The stacktraces are: ``` Backtrace of query with PID 22: #0 0x00007fa23544ed87 in write_call_graph (fd=<optimized out>) at gmon.c:287 #1 write_gmon () at gmon.c:394 #2 0x00005639548bf72b in LWLockAcquire (lock=0xffff0001, mode=LW_EXCLUSIVE) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/storage/lmgr/lwlock.c:1287 #3 0x00005639546707c1 in ginHeapTupleFastInsert (ginstate=ginstate@entry=0x7ffef442af00, collector=collector@entry=0x7ffef442aee0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/access/gin/ginfast.c:314 #4 0x0000563954663985 in gininsert (index=<optimized out>, values=0x7ffef442d590, isnull=0x7ffef442d690 "", ht_ctid=0x5639555dfb64, heapRel=<optimized out>, checkUnique=<optimized out>) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/access/gin/gininsert.c:512 #5 0x00005639547b202c in ExecInsertIndexTuples (slot=slot@entry=0x5639555dc3e0, tupleid=tupleid@entry=0x5639555dfb64, estate=estate@entry=0x5639555dbb40, noDupErr=noDupErr@entry=0 '\000', specConflict=specConflict@entry=0x0, arbiterIndexes=arbiterIndexes@entry=0x0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execIndexing.c:388 #6 0x00005639547d0ffd in ExecInsert (canSetTag=1 '\001', estate=0x5639555dbb40, onconflict=<optimized out>, arbiterIndexes=0x0, planSlot=0x5639555dc3e0, slot=0x5639555dc3e0, mtstate=0x5639555dbde0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/nodeModifyTable.c:496 #7 ExecModifyTable (node=node@entry=0x5639555dbde0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/nodeModifyTable.c:1511 #8 0x00005639547b6c38 in ExecProcNode (node=node@entry=0x5639555dbde0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execProcnode.c:396 #9 0x00005639547b2cfe in ExecutePlan (dest=0x563954e0a300 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=0 '\000', operation=CMD_INSERT, use_parallel_mode=<optimized out>, planstate=0x5639555dbde0, estate=0x5639555dbb40) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execMain.c:1567 #10 standard_ExecutorRun (queryDesc=0x5639555cf6a0, direction=<optimized out>, count=0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execMain.c:338 #11 0x00007fa22a3b1035 in pgss_ExecutorRun (queryDesc=0x5639555cf6a0, direction=ForwardScanDirection, count=0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../contrib/pg_stat_statements/pg_stat_statements.c:877 #12 0x00005639548d0549 in ProcessQuery (plan=<optimized out>, sourceText=0x5639555c6ca0 "INSERT INTO public.que_lockers( pid, worker_count, worker_priorities, ruby_pid, ruby_hostname, listening, queues ) VALUES ( pg_backend_pid(), $1::integer, $2::integer[], $3::integer, $4::text, $5::bo"..., params=0x5639555c6db0, dest=0x563954e0a300 <donothingDR>, completionTag=0x7ffef442dee0 "") at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/pquery.c:187 #13 0x00005639548d079e in PortalRunMulti (portal=portal@entry=0x5639555c0b90, isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0 '\000', dest=0x563954e0a300 <donothingDR>, dest@entry=0x5639555a5310, altdest=0x563954e0a300 <donothingDR>, altdest@entry=0x5639555a5310, completionTag=completionTag@entry=0x7ffef442dee0 "") at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/pquery.c:1303 #14 0x00005639548d1397 in PortalRun (portal=portal@entry=0x5639555c0b90, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x5639555a5310, altdest=altdest@entry=0x5639555a5310, completionTag=completionTag@entry=0x7ffef442dee0 "") at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/pquery.c:815 #15 0x00005639548ce643 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x5639555a4f00 "") at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/postgres.c:1977 #16 PostgresMain (argc=<optimized out>, argv=argv@entry=0x563955527678, dbname=0x5639555275a0 "dfkjlo7osvu4dh", username=<optimized out>) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/postgres.c:4133 #17 0x00005639546543fb in BackendRun (port=0x56395551dcb0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:4270 #18 BackendStartup (port=0x56395551dcb0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:3944 #19 ServerLoop () at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:1701 #20 0x000056395486e06b in PostmasterMain (argc=5, argv=<optimized out>) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:1309 #21 0x0000563954655142 in main (argc=5, argv=0x5639554d53a0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/main/main.c:228 Backtrace of query with PID 16987 #0 0x00007fa23544ed87 in write_call_graph (fd=<optimized out>) at gmon.c:287 #1 write_gmon () at gmon.c:394 #2 0x00005639548bf72b in LWLockAcquire (lock=0xffff000f, mode=LW_EXCLUSIVE) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/storage/lmgr/lwlock.c:1287 #3 0x00005639546704e0 in ginHeapTupleFastInsert (ginstate=ginstate@entry=0x7ffef442af00, collector=collector@entry=0x7ffef442aee0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/access/gin/ginfast.c:255 #4 0x0000563954663985 in gininsert (index=<optimized out>, values=0x7ffef442d590, isnull=0x7ffef442d690 "", ht_ctid=0x5639555df7e4, heapRel=<optimized out>, checkUnique=<optimized out>) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/access/gin/gininsert.c:512 #5 0x00005639547b202c in ExecInsertIndexTuples (slot=slot@entry=0x5639555dc060, tupleid=tupleid@entry=0x5639555df7e4, estate=estate@entry=0x5639555db7c0, noDupErr=noDupErr@entry=0 '\000', specConflict=specConflict@entry=0x0, arbiterIndexes=arbiterIndexes@entry=0x0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execIndexing.c:388 #6 0x00005639547d0ffd in ExecInsert (canSetTag=1 '\001', estate=0x5639555db7c0, onconflict=<optimized out>, arbiterIndexes=0x0, planSlot=0x5639555dc060, slot=0x5639555dc060, mtstate=0x5639555dba60) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/nodeModifyTable.c:496 #7 ExecModifyTable (node=node@entry=0x5639555dba60) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/nodeModifyTable.c:1511 #8 0x00005639547b6c38 in ExecProcNode (node=node@entry=0x5639555dba60) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execProcnode.c:396 #9 0x00005639547b2cfe in ExecutePlan (dest=0x563954e0a300 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=0 '\000', operation=CMD_INSERT, use_parallel_mode=<optimized out>, planstate=0x5639555dba60, estate=0x5639555db7c0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execMain.c:1567 #10 standard_ExecutorRun (queryDesc=0x5639555cf320, direction=<optimized out>, count=0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/executor/execMain.c:338 #11 0x00007fa22a3b1035 in pgss_ExecutorRun (queryDesc=0x5639555cf320, direction=ForwardScanDirection, count=0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../contrib/pg_stat_statements/pg_stat_statements.c:877 #12 0x00005639548d0549 in ProcessQuery (plan=<optimized out>, sourceText=0x5639555c6920 "INSERT INTO public.que_lockers( pid, worker_count, worker_priorities, ruby_pid, ruby_hostname, listening, queues ) VALUES ( pg_backend_pid(), $1::integer, $2::integer[], $3::integer, $4::text, $5::bo"..., params=0x5639555c6a30, dest=0x563954e0a300 <donothingDR>, completionTag=0x7ffef442dee0 "") at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/pquery.c:187 #13 0x00005639548d079e in PortalRunMulti (portal=portal@entry=0x5639555c0810, isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0 '\000', dest=0x563954e0a300 <donothingDR>, dest@entry=0x5639555a4f90, altdest=0x563954e0a300 <donothingDR>, altdest@entry=0x5639555a4f90, completionTag=completionTag@entry=0x7ffef442dee0 "") at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/pquery.c:1303 #14 0x00005639548d1397 in PortalRun (portal=portal@entry=0x5639555c0810, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x5639555a4f90, altdest=altdest@entry=0x5639555a4f90, completionTag=completionTag@entry=0x7ffef442dee0 "") at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/pquery.c:815 #15 0x00005639548ce643 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x5639555a4b80 "") at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/postgres.c:1977 #16 PostgresMain (argc=<optimized out>, argv=argv@entry=0x563955527678, dbname=0x5639555275a0 "dfkjlo7osvu4dh", username=<optimized out>) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/tcop/postgres.c:4133 #17 0x00005639546543fb in BackendRun (port=0x56395551dcb0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:4270 #18 BackendStartup (port=0x56395551dcb0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:3944 #19 ServerLoop () at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:1701 #20 0x000056395486e06b in PostmasterMain (argc=5, argv=<optimized out>) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/postmaster/postmaster.c:1309 #21 0x0000563954655142 in main (argc=5, argv=0x5639554d53a0) at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/main/main.c:228 ``` Heroku support isn't sure how to attach to the 9.6.5 DB without restarting the container, which will mean losing the stuck processes. We can certainly try to do that and wait for the issue to recur, though, if we think the 9.6.1 -> 9.6.5 difference is relevant. Thanks! Chris On Sun, Oct 22, 2017 at 3:23 PM, Chris Hanks <christopher.m.hanks@gmail.com> wrote: > Alright, thanks, I'll ask them to look into that first thing tomorrow > morning. They may not get back to me for a day or so, so if anyone has > anything else they'd like me to ask for in the same request, please > let me know! > > Chris > > On Sun, Oct 22, 2017 at 2:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Chris Hanks <christopher.m.hanks@gmail.com> writes: >>> I'm encountering a locking issue in Postgres. I unfortunately haven't >>> been able to make a reproduction for it, but it's occurred a few times >>> on different machines, and I have access to two machines that are >>> currently locked up, so I'm hoping that someone can help anyway. >> >> If you could get stack traces from the stuck backends, that would be >> pretty helpful. >> >> regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Connections hang indefinitely while taking a LWLockTranche buffer_content lock.
From
Tom Lane
Date:
Chris Hanks <christopher.m.hanks@gmail.com> writes: > Backtrace of query with PID 22: > #0 0x00007fa23544ed87 in write_call_graph (fd=<optimized out>) at gmon.c:287 > #1 write_gmon () at gmon.c:394 > #2 0x00005639548bf72b in LWLockAcquire (lock=0xffff0001, > mode=LW_EXCLUSIVE) at > /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/storage/lmgr/lwlock.c:1287 > #3 0x00005639546707c1 in ginHeapTupleFastInsert > (ginstate=ginstate@entry=0x7ffef442af00, > collector=collector@entry=0x7ffef442aee0) > at /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/access/gin/ginfast.c:314 > Backtrace of query with PID 16987 > #0 0x00007fa23544ed87 in write_call_graph (fd=<optimized out>) at gmon.c:287 > #1 write_gmon () at gmon.c:394 > #2 0x00005639548bf72b in LWLockAcquire (lock=0xffff000f, > mode=LW_EXCLUSIVE) at > /build/postgresql-9.6-HEXeEe/postgresql-9.6-9.6.1/build/../src/backend/storage/lmgr/lwlock.c:1287 > #3 0x00005639546704e0 in ginHeapTupleFastInsert > (ginstate=ginstate@entry=0x7ffef442af00, > collector=collector@entry=0x7ffef442aee0) So it looks like ginHeapTupleFastInsert is doing something that carries a risk of deadlocking on two or more LWLocks, if multiple processes try to insert index entries concurrently. There's no deadlock detection in the LWLock infrastructure --- that's one of the things that makes it "lightweight" locks --- so callers have to avoid such coding patterns. This is probably enough info for us to identify the exact cause. But I don't have time to look right this minute, and am not the hacker most familiar with the GIN infrastructure anyway. Anyone want to look for the bug? regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Connections hang indefinitely while taking a LWLockTranchebuffer_content lock.
From
Peter Geoghegan
Date:
On Mon, Oct 23, 2017 at 9:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > This is probably enough info for us to identify the exact cause. But > I don't have time to look right this minute, and am not the hacker most > familiar with the GIN infrastructure anyway. Anyone want to look for > the bug? I don't know very much about GIN, but this does look interesting. I'll try to get to it later in the week; I have meetings over the next couple of days. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Connections hang indefinitely while taking a LWLockTranchebuffer_content lock.
From
Chris Hanks
Date:
I'm not sure if you guys had noticed this already, but just in case, those two hung connections weren't making the inserts at exactly the same time - if you look at pg_stat_activity they executed about a day apart (since Heroku cycles the clients every 24 hours or so). And before I restarted the first DB that experienced the problem, there were ~30 backends built up over the course of a month. It seems like when one INSERT sticks, every following INSERT just stacks up on top of it, trying to take out the same lock. This isn't to say that a race condition didn't cause the issue in the first place (it may have, given how difficult it is to reproduce). But it might have been a read using the index or something like that. From my knowledge of the usage patterns for the app, I believe that only a single INSERT should have been taking place at that time - the INSERT only happens when a worker process starts up, there was only one worker process for the app, and the worker processes were only restarted every 24 hours. Also I'd expect this table to be empty when the INSERT took place (it's an unlogged table that tracks which connections belong to clients that are ready to pull from a queue). Also, the only time the index would be used, to my knowledge, is in a trigger on another table when a row is inserted. It tries to pseudorandomly pick a client that's ready for work: ``` SELECT pid INTO locker_pid FROM ( SELECT *, last_value(row_number) OVER () + 1 AS count FROM ( SELECT*, row_number() OVER () - 1 AS row_number FROM ( SELECT * FROM public.que_lockers ql, generate_series(1,ql.worker_count) AS id WHERE listening AND queues @> ARRAY[NEW.queue] ORDER BY md5(pid::text|| id::text) ) t1 ) t2 ) t3 WHERE NEW.id % count = row_number; ``` Chris On Mon, Oct 23, 2017 at 12:55 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Oct 23, 2017 at 9:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> This is probably enough info for us to identify the exact cause. But >> I don't have time to look right this minute, and am not the hacker most >> familiar with the GIN infrastructure anyway. Anyone want to look for >> the bug? > > I don't know very much about GIN, but this does look interesting. I'll > try to get to it later in the week; I have meetings over the next > couple of days. > > -- > Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Connections hang indefinitely while taking a LWLockTranchebuffer_content lock.
From
Peter Geoghegan
Date:
I managed to get a couple of hours to look at this this afternoon. On Mon, Oct 23, 2017 at 10:39 AM, Chris Hanks <christopher.m.hanks@gmail.com> wrote: > I'm not sure if you guys had noticed this already, but just in case, > those two hung connections weren't making the inserts at exactly the > same time - if you look at pg_stat_activity they executed about a day > apart (since Heroku cycles the clients every 24 hours or so). And > before I restarted the first DB that experienced the problem, there > were ~30 backends built up over the course of a month. It seems like > when one INSERT sticks, every following INSERT just stacks up on top > of it, trying to take out the same lock. Right. In both backtraces, we see that we're an inserter stuck on getting an exclusive buffer lock on the buffer containing block 0, the meta page block (GIN_METAPAGE_BLKNO). There is probably some session/backend that has acquired two buffer locks in an order that is inconsistent with these inserters, meaning that you get an undetectable deadlock. (The only alternative to that theory is that some backend sits on a meta page buffer lock for some other reason, but that seems much less likely.) The interesting question to my mind is: What backend is the other backend that acquires buffer locks in an incompatible order, resulting in this undetectable deadlock? What's it doing? (I worked for Heroku until quite recently; I may be able to open a back channel, with Chris' permission.) I remember expressing befuddlement about commit e95680832854c over a year ago, and never satisfying myself that it was correct [1]. I'm no expert on GIN, so I dropped it. It feels like that might be relevant here, since you seem to be using this GIN index with a queue table. That usage pattern is one where entire leaf pages in indexes tend to be routinely deleted and later recycled by VACUUM, at least with B-Trees [2]. Whereas, in general I think B-Tree (and presumably GIN) page deletion is fairly rare, since the entire page must be empty for it to happen. The follow up bugfix commit, e2c79e14, added a ConditionalLockPage() to the insert ginInsertCleanup() path, while also adding a LockPage() to the VACUUM path. In case you missed it, those are *heavyweight* page lock acquisitions, not buffer lock acquisitions, which is pretty unconventional (I though only hash still did that). Frankly, the e2c79e14 fix seems kind of bolted on (though I don't want to focus on that aspect right now). [1] https://postgr.es/m/CAM3SWZSDxqDBvUGOoNm0veVOwgJV3GDvoncYr6f5L16qo8MYRg@mail.gmail.com [2] https://brandur.org/postgres-queues -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Connections hang indefinitely while taking a LWLockTranchebuffer_content lock.
From
Chris Hanks
Date:
I wound up working around the issue by forking the database and removing that GIN index, and things have been fine with the new instance for the past two days. I previously had two Postgres instances with hung processes, one 9.6.1 and one 9.6.5. For work reasons I destroyed the 9.6.5 (Heroku support was having issues with its container anyway) but I've kept the 9.6.1. I'm happy to give permission for you to access it, just let me know who to talk to, or have them reach out to me. I have an open issue in their tracker under my work email, chris@pathgather.com. Thanks again! On Thu, Oct 26, 2017 at 7:14 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I managed to get a couple of hours to look at this this afternoon. > > On Mon, Oct 23, 2017 at 10:39 AM, Chris Hanks > <christopher.m.hanks@gmail.com> wrote: >> I'm not sure if you guys had noticed this already, but just in case, >> those two hung connections weren't making the inserts at exactly the >> same time - if you look at pg_stat_activity they executed about a day >> apart (since Heroku cycles the clients every 24 hours or so). And >> before I restarted the first DB that experienced the problem, there >> were ~30 backends built up over the course of a month. It seems like >> when one INSERT sticks, every following INSERT just stacks up on top >> of it, trying to take out the same lock. > > Right. In both backtraces, we see that we're an inserter stuck on > getting an exclusive buffer lock on the buffer containing block 0, the > meta page block (GIN_METAPAGE_BLKNO). There is probably some > session/backend that has acquired two buffer locks in an order that is > inconsistent with these inserters, meaning that you get an > undetectable deadlock. (The only alternative to that theory is that > some backend sits on a meta page buffer lock for some other reason, > but that seems much less likely.) > > The interesting question to my mind is: What backend is the other > backend that acquires buffer locks in an incompatible order, resulting > in this undetectable deadlock? What's it doing? (I worked for Heroku > until quite recently; I may be able to open a back channel, with > Chris' permission.) > > I remember expressing befuddlement about commit e95680832854c over a > year ago, and never satisfying myself that it was correct [1]. I'm no > expert on GIN, so I dropped it. It feels like that might be relevant > here, since you seem to be using this GIN index with a queue table. > That usage pattern is one where entire leaf pages in indexes tend to > be routinely deleted and later recycled by VACUUM, at least with > B-Trees [2]. Whereas, in general I think B-Tree (and presumably GIN) > page deletion is fairly rare, since the entire page must be empty for > it to happen. > > The follow up bugfix commit, e2c79e14, added a ConditionalLockPage() > to the insert ginInsertCleanup() path, while also adding a LockPage() > to the VACUUM path. In case you missed it, those are *heavyweight* > page lock acquisitions, not buffer lock acquisitions, which is pretty > unconventional (I though only hash still did that). Frankly, the > e2c79e14 fix seems kind of bolted on (though I don't want to focus on > that aspect right now). > > [1] https://postgr.es/m/CAM3SWZSDxqDBvUGOoNm0veVOwgJV3GDvoncYr6f5L16qo8MYRg@mail.gmail.com > [2] https://brandur.org/postgres-queues > -- > Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Connections hang indefinitely while taking a LWLockTranchebuffer_content lock.
From
Chris Hanks
Date:
Hey, I realize it's a bit late, but did anybody ever find a root cause for this? I removed the relevant index and haven't had an issue since. Thanks, Chris On Thu, Oct 26, 2017 at 7:42 PM, Chris Hanks <christopher.m.hanks@gmail.com> wrote: > I wound up working around the issue by forking the database and > removing that GIN index, and things have been fine with the new > instance for the past two days. > > I previously had two Postgres instances with hung processes, one 9.6.1 > and one 9.6.5. For work reasons I destroyed the 9.6.5 (Heroku support > was having issues with its container anyway) but I've kept the 9.6.1. > I'm happy to give permission for you to access it, just let me know > who to talk to, or have them reach out to me. I have an open issue in > their tracker under my work email, chris@pathgather.com. > > Thanks again! > > On Thu, Oct 26, 2017 at 7:14 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> I managed to get a couple of hours to look at this this afternoon. >> >> On Mon, Oct 23, 2017 at 10:39 AM, Chris Hanks >> <christopher.m.hanks@gmail.com> wrote: >>> I'm not sure if you guys had noticed this already, but just in case, >>> those two hung connections weren't making the inserts at exactly the >>> same time - if you look at pg_stat_activity they executed about a day >>> apart (since Heroku cycles the clients every 24 hours or so). And >>> before I restarted the first DB that experienced the problem, there >>> were ~30 backends built up over the course of a month. It seems like >>> when one INSERT sticks, every following INSERT just stacks up on top >>> of it, trying to take out the same lock. >> >> Right. In both backtraces, we see that we're an inserter stuck on >> getting an exclusive buffer lock on the buffer containing block 0, the >> meta page block (GIN_METAPAGE_BLKNO). There is probably some >> session/backend that has acquired two buffer locks in an order that is >> inconsistent with these inserters, meaning that you get an >> undetectable deadlock. (The only alternative to that theory is that >> some backend sits on a meta page buffer lock for some other reason, >> but that seems much less likely.) >> >> The interesting question to my mind is: What backend is the other >> backend that acquires buffer locks in an incompatible order, resulting >> in this undetectable deadlock? What's it doing? (I worked for Heroku >> until quite recently; I may be able to open a back channel, with >> Chris' permission.) >> >> I remember expressing befuddlement about commit e95680832854c over a >> year ago, and never satisfying myself that it was correct [1]. I'm no >> expert on GIN, so I dropped it. It feels like that might be relevant >> here, since you seem to be using this GIN index with a queue table. >> That usage pattern is one where entire leaf pages in indexes tend to >> be routinely deleted and later recycled by VACUUM, at least with >> B-Trees [2]. Whereas, in general I think B-Tree (and presumably GIN) >> page deletion is fairly rare, since the entire page must be empty for >> it to happen. >> >> The follow up bugfix commit, e2c79e14, added a ConditionalLockPage() >> to the insert ginInsertCleanup() path, while also adding a LockPage() >> to the VACUUM path. In case you missed it, those are *heavyweight* >> page lock acquisitions, not buffer lock acquisitions, which is pretty >> unconventional (I though only hash still did that). Frankly, the >> e2c79e14 fix seems kind of bolted on (though I don't want to focus on >> that aspect right now). >> >> [1] https://postgr.es/m/CAM3SWZSDxqDBvUGOoNm0veVOwgJV3GDvoncYr6f5L16qo8MYRg@mail.gmail.com >> [2] https://brandur.org/postgres-queues >> -- >> Peter Geoghegan