Re: How to debug a connection that's "active" but hanging? - Mailing list pgsql-general

From Jurrie Overgoor
Subject Re: How to debug a connection that's "active" but hanging?
Date
Msg-id eb266d47-dd5c-3622-aa5a-fd45ed71d9de@jurr.org
Whole thread Raw
In response to Re: How to debug a connection that's "active" but hanging?  (Vijaykumar Jain <vijaykumarjain.github@gmail.com>)
Responses Re: How to debug a connection that's "active" but hanging?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
On 10-07-2021 10:26, Vijaykumar Jain wrote:
On Sat, 10 Jul 2021 at 00:29, Jurrie Overgoor <postgresql-mailinglist@jurr.org> wrote:
Hi everyone,

We are in the process of upgrading from PostgreSQL 9.6 to 13. When our
database gets created in our regression tests, we run some unit tests
first. We see one of those tests hang.

It seems the client is waiting on more data to arrive from the
PostgreSQL server. A thread dump shows it waiting on the socket.

On the server, I see the connection from the client. Looking at
pg_stat_activity, I see it is in state 'active'. I have seen query_start
be as old as eight hours ago. The documentation states that 'active'
means a query is being executed. The query in question is:

Looking at pg_locks, I only see locks with granted = true for the PID.

Is this reproducible, I mean this happens multiple times?


Hi Vijaykumar, thanks for replying. Yes, this is reproducible. About 50% of the times, my connection is hanging. It's always on the same query, which I shared in the previous post. These are unit tests that are executed just after the database is created from scratch using Liquibase.


can you please run the below query in a separate session and share the result, feel free to anonymize sensitive stuff.

SELECT db.datname, locktype, relation::regclass, mode, transactionid
AS tid, virtualtransaction AS vtid, pid, granted
FROM pg_catalog.pg_locks l
LEFT JOIN pg_catalog.pg_database db ON db.oid = l.database
WHERE NOT pid = pg_backend_pid();


Sure! This is all testdata; there is not much that needs to be anonymized :) Here it is:


|datname|locktype  |relation                                     |mode           |tid|vtid   |pid  |granted|
|-------|----------|---------------------------------------------|---------------|---|-------|-----|-------|
|wildfly|relation  |pg_constraint_contypid_index                 |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_constraint_conrelid_contypid_conname_index|AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_constraint_conname_nsp_index              |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_constraint_conparentid_index              |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_namespace_oid_index                       |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_namespace_nspname_index                   |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_attribute_relid_attnum_index              |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_attribute_relid_attnam_index              |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_depend                                    |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_constraint                                |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_class                                     |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_namespace                                 |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_attribute                                 |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |information_schema.table_constraints         |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |information_schema.referential_constraints   |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |information_schema.key_column_usage          |AccessShareLock|   |4/46389|22928|true   |
|       |virtualxid|                                             |ExclusiveLock  |   |4/46389|22928|true   |
|wildfly|relation  |pg_depend_depender_index                     |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_class_oid_index                           |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_constraint_oid_index                      |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_class_relname_nsp_index                   |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_depend_reference_index                    |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_class_tblspc_relfilenode_index            |AccessShareLock|   |4/46389|22928|true   |


Indeed, this time it's PID 22928 that's hanging.


There is nothing in the logs as far as I can see. Configuration
variables log_min_messages and log_min_error_statement are on 'debug1'.
This is a snippet of the logs:

2021-07-09 20:35:16.374 CEST [30399] STATEMENT:  START TRANSACTION
2021-07-09 20:35:18.703 CEST [30399] WARNING:  there is already a
transaction in progress
Are you having multiple BEGIN tx not matching commit/rollbacks ? not sure related, but fixing this will help reduce noise.


Actually, I'm not quite sure where those messages come from. They of course indicate starting a transaction that's already in place. But I'd have to dig into the framework we use to know where these come from. My gut feeling is that this is not part of the problem, rather just noise.


I am a bit out of ideas - does anyone have any tips where I should look
to see what is causing the query to hang?

Although I am not an expert at this, I am exploring myself.
Can you collect backtrace for this pid?


You can try enabling core dumps, then run a core dump using `gcore` on the hanging pid (if you do not want to attach to a live process)
and generate a backtrace on that pid using `gdb` and share the trace.
I think that might help.


I'm on CentOS 7 and I tried to install the debuginfo packages, but they seem to be unavailable / missing from the repository. Do you know where to get those packages? So the backtrace is lacking info.

As this is a test machine, I have no problem connecting gdb directly to the hanging process. I created a backtrace using the steps outlined under "Attaching gdb to the backend" in your link. I interrupted four times. The resulting file is this:

Continuing.

Program received signal SIGINT, Interrupt.
0x000000000063819d in tts_buffer_heap_getsomeattrs ()
#0  0x000000000063819d in tts_buffer_heap_getsomeattrs ()
#1  0x0000000000639cda in slot_getsomeattrs_int ()
#2  0x000000000062a59f in ExecInterpExpr ()
#3  0x0000000000637403 in ExecScan ()
#4  0x000000000064d2e8 in ExecHashJoin ()
#5  0x000000000064d2e8 in ExecHashJoin ()
#6  0x000000000063758a in ExecScan ()
#7  0x0000000000644209 in ExecAppend ()
#8  0x0000000000658fe5 in ExecNestLoop ()
#9  0x0000000000654afa in ExecMergeJoin ()
#10 0x0000000000658fe5 in ExecNestLoop ()
#11 0x0000000000654afa in ExecMergeJoin ()
#12 0x0000000000658fe5 in ExecNestLoop ()
#13 0x000000000064c845 in MultiExecHash ()
#14 0x000000000064d21f in ExecHashJoin ()
#15 0x0000000000658ed9 in ExecNestLoop ()
#16 0x000000000065b3a6 in ExecSort ()
#17 0x000000000065ee68 in ExecUnique ()
#18 0x000000000062db82 in standard_ExecutorRun ()
#19 0x00000000007922cb in PortalRunSelect ()
#20 0x00000000007935c7 in PortalRun ()
#21 0x0000000000791042 in PostgresMain ()
#22 0x0000000000487dbf in ServerLoop ()
#23 0x0000000000719db8 in PostmasterMain ()
#24 0x0000000000488cfd in main ()

Continuing.

Program received signal SIGINT, Interrupt.
0x000000000063819b in tts_buffer_heap_getsomeattrs ()
#0  0x000000000063819b in tts_buffer_heap_getsomeattrs ()
#1  0x0000000000639cda in slot_getsomeattrs_int ()
#2  0x000000000062a59f in ExecInterpExpr ()
#3  0x0000000000637403 in ExecScan ()
#4  0x0000000000658fe5 in ExecNestLoop ()
#5  0x0000000000658ed9 in ExecNestLoop ()
#6  0x000000000063758a in ExecScan ()
#7  0x0000000000644209 in ExecAppend ()
#8  0x0000000000658fe5 in ExecNestLoop ()
#9  0x0000000000654afa in ExecMergeJoin ()
#10 0x0000000000658fe5 in ExecNestLoop ()
#11 0x0000000000654afa in ExecMergeJoin ()
#12 0x0000000000658fe5 in ExecNestLoop ()
#13 0x000000000064c845 in MultiExecHash ()
#14 0x000000000064d21f in ExecHashJoin ()
#15 0x0000000000658ed9 in ExecNestLoop ()
#16 0x000000000065b3a6 in ExecSort ()
#17 0x000000000065ee68 in ExecUnique ()
#18 0x000000000062db82 in standard_ExecutorRun ()
#19 0x00000000007922cb in PortalRunSelect ()
#20 0x00000000007935c7 in PortalRun ()
#21 0x0000000000791042 in PostgresMain ()
#22 0x0000000000487dbf in ServerLoop ()
#23 0x0000000000719db8 in PostmasterMain ()
#24 0x0000000000488cfd in main ()

Continuing.

Program received signal SIGINT, Interrupt.
0x00000000004eaee7 in _bt_fix_scankey_strategy ()
#0  0x00000000004eaee7 in _bt_fix_scankey_strategy ()
#1  0x00000000004ec2d5 in _bt_preprocess_keys ()
#2  0x00000000004e76b2 in _bt_first ()
#3  0x00000000004e40eb in btgettuple ()
#4  0x00000000004dca01 in index_getnext_tid ()
#5  0x00000000004dcb5b in index_getnext_slot ()
#6  0x0000000000650996 in IndexNext ()
#7  0x000000000063758a in ExecScan ()
#8  0x0000000000658fe5 in ExecNestLoop ()
#9  0x0000000000658ed9 in ExecNestLoop ()
#10 0x000000000063758a in ExecScan ()
#11 0x0000000000644209 in ExecAppend ()
#12 0x0000000000658fe5 in ExecNestLoop ()
#13 0x0000000000654afa in ExecMergeJoin ()
#14 0x0000000000658fe5 in ExecNestLoop ()
#15 0x0000000000654afa in ExecMergeJoin ()
#16 0x0000000000658fe5 in ExecNestLoop ()
#17 0x000000000064c845 in MultiExecHash ()
#18 0x000000000064d21f in ExecHashJoin ()
#19 0x0000000000658ed9 in ExecNestLoop ()
#20 0x000000000065b3a6 in ExecSort ()
#21 0x000000000065ee68 in ExecUnique ()
#22 0x000000000062db82 in standard_ExecutorRun ()
#23 0x00000000007922cb in PortalRunSelect ()
#24 0x00000000007935c7 in PortalRun ()
#25 0x0000000000791042 in PostgresMain ()
#26 0x0000000000487dbf in ServerLoop ()
#27 0x0000000000719db8 in PostmasterMain ()
#28 0x0000000000488cfd in main ()

Continuing.

Program received signal SIGINT, Interrupt.
0x00007f5af41f33ca in __memcpy_ssse3_back () from /lib64/libc.so.6
#0  0x00007f5af41f33ca in __memcpy_ssse3_back () from /lib64/libc.so.6
#1  0x000000000087fa4a in text_catenate ()
#2  0x00000000006295d0 in ExecInterpExpr ()
#3  0x000000000064d5f9 in ExecHashJoin ()
#4  0x000000000063758a in ExecScan ()
#5  0x0000000000644209 in ExecAppend ()
#6  0x0000000000658fe5 in ExecNestLoop ()
#7  0x0000000000654afa in ExecMergeJoin ()
#8  0x0000000000658fe5 in ExecNestLoop ()
#9  0x0000000000654afa in ExecMergeJoin ()
#10 0x0000000000658fe5 in ExecNestLoop ()
#11 0x000000000064c845 in MultiExecHash ()
#12 0x000000000064d21f in ExecHashJoin ()
#13 0x0000000000658ed9 in ExecNestLoop ()
#14 0x000000000065b3a6 in ExecSort ()
#15 0x000000000065ee68 in ExecUnique ()
#16 0x000000000062db82 in standard_ExecutorRun ()
#17 0x00000000007922cb in PortalRunSelect ()
#18 0x00000000007935c7 in PortalRun ()
#19 0x0000000000791042 in PostgresMain ()
#20 0x0000000000487dbf in ServerLoop ()
#21 0x0000000000719db8 in PostmasterMain ()
#22 0x0000000000488cfd in main ()
A debugging session is active.

        Inferior 1 [process 22928] will be detached.

Quit anyway? (y or n) Detaching from program: /usr/pgsql-13/bin/postgres, process 22928
[Inferior 1 (process 22928) detached]


I hope this gives some insight. If anyone can point me to the debuginfo CentOS packages; I'd be happy to generate a backtrace again.

With kind regards,

Jurrie





pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Use of '&' as table prefix in query
Next
From: Rob Sargent
Date:
Subject: Re: Use of '&' as table prefix in query