Thread: COPY FROM STDIN hang

COPY FROM STDIN hang

From
Robert Creager
Date:
PostgreSQL 9.6.17 on amd64-portbld-freebsd12.1, compiled by FreeBSD clang version 9.0.1
(git@github.com:llvm/llvm-project.gitc1a0a213378a458fbea1a5c77b315c7dce08fd05) (based on LLVM 9.0.1), 64-bit 
9.3-1104-jdbc41
Java 1.8.0_242

We’re getting occasional hangs when using COPY FROM STDIN.  Note that we can reliably reproduce this problem on our
applianceafter a few hours of runtime with a specific workload.  Namely, doing 30-50 COPY operations per second. Each
queryis copying a few key/value metadata pairs, nothing significant. 

   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:143)
        at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:112)
        at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:71)
        at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:282)
        at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:930)
        at org.postgresql.core.v3.QueryExecutorImpl.endCopy(QueryExecutorImpl.java:828)
        - locked <43521773> (a org.postgresql.core.v3.QueryExecutorImpl)
        at org.postgresql.core.v3.CopyInImpl.endCopy(CopyInImpl.java:59)
        at org.postgresql.copy.CopyManager.copyIn(CopyManager.java:203)
        at org.postgresql.copy.CopyManager.copyIn(CopyManager.java:183)
…

This is the only query running on the server at the time when we get the hang.  I’m unable to cancel or terminate the
query,I have to kill -9 the pid from the command line.  When the COPY operations come in at a more limited pace, we
don’tsee this issue. 

(SELECT pid,
    client_port,
    now() - query_start AS "runtime",
    query_start,
    datname,
    state,
    wait_event_type,
    query,
    usename
FROM pg_stat_activity
WHERE query !~ 'pg_stat_activity' AND
    state != 'idle'
ORDER BY state, runtime DESC;)

12976    14322    0 years 0 mons 0 days 13 hours 35 mins 5.073483 secs    2021-03-23 07:02:22.892034    tapesystem
active       COPY ds3.s3_object_property (id, key, object_id, value) FROM STDIN WITH DELIMITER AS '|'    Administrator 

Thoughts?

Thanks,
Robert


Re: COPY FROM STDIN hang

From
Tom Lane
Date:
Robert Creager <robert@logicalchaos.org> writes:
> We’re getting occasional hangs when using COPY FROM STDIN.

You mean the backend hangs?  Can you get a stack trace from that?

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

            regards, tom lane



Re: COPY FROM STDIN hang

From
Robert Creager
Date:

> On Mar 23, 2021, at 3:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Creager <robert@logicalchaos.org> writes:
>> We’re getting occasional hangs when using COPY FROM STDIN.
>
> You mean the backend hangs?  Can you get a stack trace from that?

The entire backend doesn’t hang, we’re able to process queries on other connections.  It’s just the COPY FROM thread
that’shung. 

>
> https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

(gdb) cont
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00000000006a0d5b in UnpinBuffer ()
(gdb) bt
#0  0x00000000006a0d5b in UnpinBuffer ()
#1  0x00000000006a0abe in ReleaseAndReadBuffer ()
#2  0x00000000004a9c00 in _bt_relandgetbuf ()
#3  0x00000000004acbc6 in _bt_moveright ()
#4  0x00000000004aca0a in _bt_search ()
#5  0x00000000004a6a6e in _bt_doinsert ()
#6  0x00000000004abac6 in btinsert ()
#7  0x00000000005b52b0 in ExecInsertIndexTuples ()
#8  0x0000000000561de0 in CopyFromInsertBatch ()
#9  0x000000000055c3cd in DoCopy ()
#10 0x00000000006d46f1 in standard_ProcessUtility ()
#11 0x00000000006d3c69 in PortalRunUtility ()
#12 0x00000000006d3346 in PortalRunMulti ()
#13 0x00000000006d2d59 in PortalRun ()
#14 0x00000000006d11c8 in PostgresMain ()
#15 0x000000000066b224 in BackendRun ()
#16 0x000000000066a8e6 in ServerLoop ()
#17 0x00000000006679c0 in PostmasterMain ()
#18 0x00000000005f4da7 in main ()
(gdb) cont
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00000000006c2c88 in LWLockRelease ()
(gdb) bt
#0  0x00000000006c2c88 in LWLockRelease ()
#1  0x000000000069fd3c in ReadBuffer_common ()
#2  0x000000000069fb6c in ReadBufferExtended ()
#3  0x00000000004a9c00 in _bt_relandgetbuf ()
#4  0x00000000004acbc6 in _bt_moveright ()
#5  0x00000000004aca0a in _bt_search ()
#6  0x00000000004a6a6e in _bt_doinsert ()
#7  0x00000000004abac6 in btinsert ()
#8  0x00000000005b52b0 in ExecInsertIndexTuples ()
#9  0x0000000000561de0 in CopyFromInsertBatch ()
#10 0x000000000055c3cd in DoCopy ()
#11 0x00000000006d46f1 in standard_ProcessUtility ()
#12 0x00000000006d3c69 in PortalRunUtility ()
#13 0x00000000006d3346 in PortalRunMulti ()
#14 0x00000000006d2d59 in PortalRun ()
#15 0x00000000006d11c8 in PostgresMain ()
#16 0x000000000066b224 in BackendRun ()
#17 0x000000000066a8e6 in ServerLoop ()
#18 0x00000000006679c0 in PostmasterMain ()
#19 0x00000000005f4da7 in main ()
(gdb) cont
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00000000004accba in _bt_moveright ()
(gdb) bt
#0  0x00000000004accba in _bt_moveright ()
#1  0x00000000004aca0a in _bt_search ()
#2  0x00000000004a6a6e in _bt_doinsert ()
#3  0x00000000004abac6 in btinsert ()
#4  0x00000000005b52b0 in ExecInsertIndexTuples ()
#5  0x0000000000561de0 in CopyFromInsertBatch ()
#6  0x000000000055c3cd in DoCopy ()
#7  0x00000000006d46f1 in standard_ProcessUtility ()
#8  0x00000000006d3c69 in PortalRunUtility ()
#9  0x00000000006d3346 in PortalRunMulti ()
#10 0x00000000006d2d59 in PortalRun ()
#11 0x00000000006d11c8 in PostgresMain ()
#12 0x000000000066b224 in BackendRun ()
#13 0x000000000066a8e6 in ServerLoop ()
#14 0x00000000006679c0 in PostmasterMain ()
#15 0x00000000005f4da7 in main ()
(gdb) detach




Re: COPY FROM STDIN hang

From
Tom Lane
Date:
Robert Creager <robert@logicalchaos.org> writes:
> On Mar 23, 2021, at 3:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> You mean the backend hangs?  Can you get a stack trace from that?

> The entire backend doesn’t hang, we’re able to process queries on other connections.  It’s just the COPY FROM thread
that’shung. 

Hmm, that set of traces doesn't really look like it's hung,
it might just be taking a long time.

A different theory, which is hard to prove without debug symbols
to provide more detail in the trace, is that it might be looping
due to following a closed loop of btree page right-links.  Which
shouldn't happen normally, but perhaps this index is corrupt.
Can you try reindexing the indexes of the COPY's target table?

            regards, tom lane



Re: COPY FROM STDIN hang

From
Robert Creager
Date:


On Mar 23, 2021, at 3:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Creager <robert@logicalchaos.org> writes:
On Mar 23, 2021, at 3:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
You mean the backend hangs?  Can you get a stack trace from that?

The entire backend doesn’t hang, we’re able to process queries on other connections.  It’s just the COPY FROM thread that’s hung.

Hmm, that set of traces doesn't really look like it's hung,
it might just be taking a long time.

13 hours is a long time.


A different theory, which is hard to prove without debug symbols
to provide more detail in the trace, is that it might be looping
due to following a closed loop of btree page right-links.  Which
shouldn't happen normally, but perhaps this index is corrupt.
Can you try reindexing the indexes of the COPY's target table?

The REINDEX hangs on locks, I’d have to kill the COPY and reproduce it to get it running, wouldn’t I.  
SELECT pid,
client_port,
now() - query_start AS "runtime",
query_start,
datname,
state,
wait_event_type,
query,
usename
FROM pg_stat_activity
WHERE query !~ 'pg_stat_activity' AND
state != 'idle'
ORDER BY state, runtime DESC;

pidclient_portruntimequery_startdatnamestatewait_event_typequeryusename
12976143220 years 0 mons 0 days 14 hours 47 mins 1.687744 secs2021-03-23 07:02:22.892034tapesystemactiveNULLCOPY ds3.s3_object_property (id, key, object_id, value) FROM STDIN WITH DELIMITER AS '|'Administrator
48638328680 years 0 mons 0 days 0 hours 6 mins 23.497398 secs2021-03-23 21:43:01.082380tapesystemactiveLockREINDEX TABLE ds3.s3_object_propertyAdministrator
WITH RECURSIVE
c(requested, CURRENT) AS (VALUES ('AccessShareLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
('RowShareLock' :: TEXT, 'ExclusiveLock' :: TEXT),
('RowShareLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
('RowExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
('RowExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
('RowExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
('RowExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
('ShareUpdateExclusiveLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
('ShareUpdateExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
('ShareUpdateExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
('ShareUpdateExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
('ShareUpdateExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
('ShareLock' :: TEXT, 'RowExclusiveLock' :: TEXT),
('ShareLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
('ShareLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
('ShareLock' :: TEXT, 'ExclusiveLock' :: TEXT),
('ShareLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
('ShareRowExclusiveLock' :: TEXT, 'RowExclusiveLock' :: TEXT),
('ShareRowExclusiveLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
('ShareRowExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
('ShareRowExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
('ShareRowExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
('ShareRowExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
('ExclusiveLock' :: TEXT, 'RowShareLock' :: TEXT),
('ExclusiveLock' :: TEXT, 'RowExclusiveLock' :: TEXT),
('ExclusiveLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
('ExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
('ExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
('ExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
('ExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'AccessShareLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'RowShareLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'RowExclusiveLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT)),
l AS
(
SELECT (locktype, DATABASE, relation :: REGCLASS :: TEXT, page, tuple, virtualxid, transactionid, classid, objid, objsubid) AS target,
virtualtransaction,
pid,
mode,
GRANTED
FROM pg_catalog.pg_locks
),
t AS
(
SELECT blocker.target AS blocker_target,
blocker.pid AS blocker_pid,
blocker.mode AS blocker_mode,
blocked.target AS target,
blocked.pid AS pid,
blocked.mode AS mode
FROM l blocker
JOIN l blocked
ON (NOT blocked.GRANTED
AND blocker.GRANTED
AND blocked.pid != blocker.pid
AND blocked.target IS NOT DISTINCT FROM blocker.target)
JOIN c ON (c.requested = blocked.mode AND c.CURRENT = blocker.mode)
),
r AS
(
SELECT blocker_target,
blocker_pid,
blocker_mode,
'1' :: INT AS depth,
target,
pid,
mode,
blocker_pid :: TEXT || ',' || pid :: TEXT AS seq
FROM t
UNION ALL
SELECT blocker.blocker_target,
blocker.blocker_pid,
blocker.blocker_mode,
blocker.depth + 1,
blocked.target,
blocked.pid,
blocked.mode,
blocker.seq || ',' || blocked.pid :: TEXT
FROM r blocker
JOIN t blocked
ON (blocked.blocker_pid = blocker.pid)
WHERE blocker.depth < 1000
)

blocker_targetblocker_pidblocker_modedepthtargetpidmodeseq
(relation,16954,ds3.s3_object_property,,,,,,,)12976RowExclusiveLock1(relation,16954,ds3.s3_object_property,,,,,,,)48638ShareLock12976,48638
\

Re: COPY FROM STDIN hang

From
Robert Creager
Date:


On Mar 23, 2021, at 3:51 PM, Robert Creager <robert@logicalchaos.org> wrote:

A different theory, which is hard to prove without debug symbols
to provide more detail in the trace, is that it might be looping
due to following a closed loop of btree page right-links.  Which
shouldn't happen normally, but perhaps this index is corrupt.
Can you try reindexing the indexes of the COPY's target table?


Thanks Tom, since the reindex we’ve been unable to reproduce the problem again.

The big question is do you have any random thoughts on how a index may of gotten corrupted?  The db lives on a two SSC ZFS mirror, so I’d not expect disk corruption.

Best,
Robert