Thread: COPY FROM STDIN hang

COPY FROM STDIN hang

From

Robert Creager

Date:

23 March 2021, 21:00:23

PostgreSQL 9.6.17 on amd64-portbld-freebsd12.1, compiled by FreeBSD clang version 9.0.1
(git@github.com:llvm/llvm-project.gitc1a0a213378a458fbea1a5c77b315c7dce08fd05) (based on LLVM 9.0.1), 64-bit 
9.3-1104-jdbc41
Java 1.8.0_242

We’re getting occasional hangs when using COPY FROM STDIN.  Note that we can reliably reproduce this problem on our
applianceafter a few hours of runtime with a specific workload.  Namely, doing 30-50 COPY operations per second. Each
queryis copying a few key/value metadata pairs, nothing significant. 

   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:143)
        at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:112)
        at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:71)
        at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:282)
        at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:930)
        at org.postgresql.core.v3.QueryExecutorImpl.endCopy(QueryExecutorImpl.java:828)
        - locked <43521773> (a org.postgresql.core.v3.QueryExecutorImpl)
        at org.postgresql.core.v3.CopyInImpl.endCopy(CopyInImpl.java:59)
        at org.postgresql.copy.CopyManager.copyIn(CopyManager.java:203)
        at org.postgresql.copy.CopyManager.copyIn(CopyManager.java:183)
…

This is the only query running on the server at the time when we get the hang.  I’m unable to cancel or terminate the
query,I have to kill -9 the pid from the command line.  When the COPY operations come in at a more limited pace, we
don’tsee this issue. 

(SELECT pid,
    client_port,
    now() - query_start AS "runtime",
    query_start,
    datname,
    state,
    wait_event_type,
    query,
    usename
FROM pg_stat_activity
WHERE query !~ 'pg_stat_activity' AND
    state != 'idle'
ORDER BY state, runtime DESC;)

12976    14322    0 years 0 mons 0 days 13 hours 35 mins 5.073483 secs    2021-03-23 07:02:22.892034    tapesystem
active       COPY ds3.s3_object_property (id, key, object_id, value) FROM STDIN WITH DELIMITER AS '|'    Administrator 

Thoughts?

Thanks,
Robert

Re: COPY FROM STDIN hang

From

Tom Lane

Date:

23 March 2021, 21:16:56

Robert Creager <robert@logicalchaos.org> writes:
> We’re getting occasional hangs when using COPY FROM STDIN.

You mean the backend hangs?  Can you get a stack trace from that?

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

            regards, tom lane

Re: COPY FROM STDIN hang

From

Robert Creager

Date:

23 March 2021, 21:30:37


> On Mar 23, 2021, at 3:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Creager <robert@logicalchaos.org> writes:
>> We’re getting occasional hangs when using COPY FROM STDIN.
>
> You mean the backend hangs?  Can you get a stack trace from that?

The entire backend doesn’t hang, we’re able to process queries on other connections.  It’s just the COPY FROM thread
that’shung. 

>
> https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

(gdb) cont
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00000000006a0d5b in UnpinBuffer ()
(gdb) bt
#0  0x00000000006a0d5b in UnpinBuffer ()
#1  0x00000000006a0abe in ReleaseAndReadBuffer ()
#2  0x00000000004a9c00 in _bt_relandgetbuf ()
#3  0x00000000004acbc6 in _bt_moveright ()
#4  0x00000000004aca0a in _bt_search ()
#5  0x00000000004a6a6e in _bt_doinsert ()
#6  0x00000000004abac6 in btinsert ()
#7  0x00000000005b52b0 in ExecInsertIndexTuples ()
#8  0x0000000000561de0 in CopyFromInsertBatch ()
#9  0x000000000055c3cd in DoCopy ()
#10 0x00000000006d46f1 in standard_ProcessUtility ()
#11 0x00000000006d3c69 in PortalRunUtility ()
#12 0x00000000006d3346 in PortalRunMulti ()
#13 0x00000000006d2d59 in PortalRun ()
#14 0x00000000006d11c8 in PostgresMain ()
#15 0x000000000066b224 in BackendRun ()
#16 0x000000000066a8e6 in ServerLoop ()
#17 0x00000000006679c0 in PostmasterMain ()
#18 0x00000000005f4da7 in main ()
(gdb) cont
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00000000006c2c88 in LWLockRelease ()
(gdb) bt
#0  0x00000000006c2c88 in LWLockRelease ()
#1  0x000000000069fd3c in ReadBuffer_common ()
#2  0x000000000069fb6c in ReadBufferExtended ()
#3  0x00000000004a9c00 in _bt_relandgetbuf ()
#4  0x00000000004acbc6 in _bt_moveright ()
#5  0x00000000004aca0a in _bt_search ()
#6  0x00000000004a6a6e in _bt_doinsert ()
#7  0x00000000004abac6 in btinsert ()
#8  0x00000000005b52b0 in ExecInsertIndexTuples ()
#9  0x0000000000561de0 in CopyFromInsertBatch ()
#10 0x000000000055c3cd in DoCopy ()
#11 0x00000000006d46f1 in standard_ProcessUtility ()
#12 0x00000000006d3c69 in PortalRunUtility ()
#13 0x00000000006d3346 in PortalRunMulti ()
#14 0x00000000006d2d59 in PortalRun ()
#15 0x00000000006d11c8 in PostgresMain ()
#16 0x000000000066b224 in BackendRun ()
#17 0x000000000066a8e6 in ServerLoop ()
#18 0x00000000006679c0 in PostmasterMain ()
#19 0x00000000005f4da7 in main ()
(gdb) cont
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00000000004accba in _bt_moveright ()
(gdb) bt
#0  0x00000000004accba in _bt_moveright ()
#1  0x00000000004aca0a in _bt_search ()
#2  0x00000000004a6a6e in _bt_doinsert ()
#3  0x00000000004abac6 in btinsert ()
#4  0x00000000005b52b0 in ExecInsertIndexTuples ()
#5  0x0000000000561de0 in CopyFromInsertBatch ()
#6  0x000000000055c3cd in DoCopy ()
#7  0x00000000006d46f1 in standard_ProcessUtility ()
#8  0x00000000006d3c69 in PortalRunUtility ()
#9  0x00000000006d3346 in PortalRunMulti ()
#10 0x00000000006d2d59 in PortalRun ()
#11 0x00000000006d11c8 in PostgresMain ()
#12 0x000000000066b224 in BackendRun ()
#13 0x000000000066a8e6 in ServerLoop ()
#14 0x00000000006679c0 in PostmasterMain ()
#15 0x00000000005f4da7 in main ()
(gdb) detach

Re: COPY FROM STDIN hang

From

Tom Lane

Date:

23 March 2021, 21:38:01

Robert Creager <robert@logicalchaos.org> writes:
> On Mar 23, 2021, at 3:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> You mean the backend hangs?  Can you get a stack trace from that?

> The entire backend doesn’t hang, we’re able to process queries on other connections.  It’s just the COPY FROM thread
that’shung. 

Hmm, that set of traces doesn't really look like it's hung,
it might just be taking a long time.

A different theory, which is hard to prove without debug symbols
to provide more detail in the trace, is that it might be looping
due to following a closed loop of btree page right-links.  Which
shouldn't happen normally, but perhaps this index is corrupt.
Can you try reindexing the indexes of the COPY's target table?

            regards, tom lane

Re: COPY FROM STDIN hang

From

Robert Creager

Date:

23 March 2021, 21:51:15

On Mar 23, 2021, at 3:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Creager <robert@logicalchaos.org> writes:
On Mar 23, 2021, at 3:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
You mean the backend hangs? Can you get a stack trace from that?

The entire backend doesn’t hang, we’re able to process queries on other connections. It’s just the COPY FROM thread that’s hung.

Hmm, that set of traces doesn't really look like it's hung,
it might just be taking a long time.

13 hours is a long time.

A different theory, which is hard to prove without debug symbols
to provide more detail in the trace, is that it might be looping
due to following a closed loop of btree page right-links. Which
shouldn't happen normally, but perhaps this index is corrupt.
Can you try reindexing the indexes of the COPY's target table?

The REINDEX hangs on locks, I’d have to kill the COPY and reproduce it to get it running, wouldn’t I.

SELECT pid,
    client_port,
    now() - query_start AS "runtime",
    query_start,
    datname,
    state,
    wait_event_type,
    query,
    usename
FROM pg_stat_activity
WHERE query !~ 'pg_stat_activity' AND
    state != 'idle'
ORDER BY state, runtime DESC;

pid	client_port	runtime	query_start	datname	state	wait_event_type	query	usename
12976	14322	0 years 0 mons 0 days 14 hours 47 mins 1.687744 secs	2021-03-23 07:02:22.892034	tapesystem	active	NULL	COPY ds3.s3_object_property (id, key, object_id, value) FROM STDIN WITH DELIMITER AS '\|'	Administrator
48638	32868	0 years 0 mons 0 days 0 hours 6 mins 23.497398 secs	2021-03-23 21:43:01.082380	tapesystem	active	Lock	REINDEX TABLE ds3.s3_object_property	Administrator

WITH RECURSIVE
    c(requested, CURRENT) AS (VALUES ('AccessShareLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
                                  ('RowShareLock' :: TEXT, 'ExclusiveLock' :: TEXT),
                                  ('RowShareLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
                                  ('RowExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
                                  ('RowExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
                                  ('RowExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
                                  ('RowExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
                                  ('ShareUpdateExclusiveLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
                                  ('ShareUpdateExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
                                  ('ShareUpdateExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
                                  ('ShareUpdateExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
                                  ('ShareUpdateExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
                                  ('ShareLock' :: TEXT, 'RowExclusiveLock' :: TEXT),
                                  ('ShareLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
                                  ('ShareLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
                                  ('ShareLock' :: TEXT, 'ExclusiveLock' :: TEXT),
                                  ('ShareLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
                                  ('ShareRowExclusiveLock' :: TEXT, 'RowExclusiveLock' :: TEXT),
                                  ('ShareRowExclusiveLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
                                  ('ShareRowExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
                                  ('ShareRowExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
                                  ('ShareRowExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
                                  ('ShareRowExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
                                  ('ExclusiveLock' :: TEXT, 'RowShareLock' :: TEXT),
                                  ('ExclusiveLock' :: TEXT, 'RowExclusiveLock' :: TEXT),
                                  ('ExclusiveLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
                                  ('ExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
                                  ('ExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
                                  ('ExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
                                  ('ExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
                                  ('AccessExclusiveLock' :: TEXT, 'AccessShareLock' :: TEXT),
                                  ('AccessExclusiveLock' :: TEXT, 'RowShareLock' :: TEXT),
                                  ('AccessExclusiveLock' :: TEXT, 'RowExclusiveLock' :: TEXT),
                                  ('AccessExclusiveLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
                                  ('AccessExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
                                  ('AccessExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
                                  ('AccessExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
                                  ('AccessExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT)),
    l AS
        (
            SELECT (locktype, DATABASE, relation :: REGCLASS :: TEXT, page, tuple, virtualxid, transactionid, classid, objid, objsubid) AS target,
                virtualtransaction,
                pid,
                mode,
                GRANTED
            FROM pg_catalog.pg_locks
        ),
    t AS
        (
            SELECT blocker.target AS blocker_target,
                blocker.pid AS blocker_pid,
                blocker.mode AS blocker_mode,
                blocked.target AS target,
                blocked.pid AS pid,
                blocked.mode AS mode
            FROM l blocker
                     JOIN l blocked
            ON (NOT blocked.GRANTED
                AND blocker.GRANTED
                AND blocked.pid != blocker.pid
                AND blocked.target IS NOT DISTINCT FROM blocker.target)
                     JOIN c ON (c.requested = blocked.mode AND c.CURRENT = blocker.mode)
        ),
    r AS
        (
            SELECT blocker_target,
                blocker_pid,
                blocker_mode,
                '1' :: INT AS depth,
                target,
                pid,
                mode,
                blocker_pid :: TEXT || ',' || pid :: TEXT AS seq
            FROM t
            UNION ALL
            SELECT blocker.blocker_target,
                blocker.blocker_pid,
                blocker.blocker_mode,
                blocker.depth + 1,
                blocked.target,
                blocked.pid,
                blocked.mode,
                blocker.seq || ',' || blocked.pid :: TEXT
            FROM r blocker
                     JOIN t blocked
            ON (blocked.blocker_pid = blocker.pid)
            WHERE blocker.depth < 1000
        )

blocker_target	blocker_pid	blocker_mode	depth	target	pid	mode	seq
(relation,16954,ds3.s3_object_property,,,,,,,)	12976	RowExclusiveLock	1	(relation,16954,ds3.s3_object_property,,,,,,,)	48638	ShareLock	12976,48638

Re: COPY FROM STDIN hang

From

Robert Creager

Date:

25 March 2021, 15:58:23

On Mar 23, 2021, at 3:51 PM, Robert Creager <robert@logicalchaos.org> wrote:

A different theory, which is hard to prove without debug symbols
to provide more detail in the trace, is that it might be looping
due to following a closed loop of btree page right-links. Which
shouldn't happen normally, but perhaps this index is corrupt.
Can you try reindexing the indexes of the COPY's target table?

Thanks Tom, since the reindex we’ve been unable to reproduce the problem again.

The big question is do you have any random thoughts on how a index may of gotten corrupted? The db lives on a two SSC ZFS mirror, so I’d not expect disk corruption.

Best,

Robert