Thread: COPY FROM STDIN hang
PostgreSQL 9.6.17 on amd64-portbld-freebsd12.1, compiled by FreeBSD clang version 9.0.1 (git@github.com:llvm/llvm-project.gitc1a0a213378a458fbea1a5c77b315c7dce08fd05) (based on LLVM 9.0.1), 64-bit 9.3-1104-jdbc41 Java 1.8.0_242 We’re getting occasional hangs when using COPY FROM STDIN. Note that we can reliably reproduce this problem on our applianceafter a few hours of runtime with a specific workload. Namely, doing 30-50 COPY operations per second. Each queryis copying a few key/value metadata pairs, nothing significant. java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:143) at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:112) at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:71) at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:282) at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:930) at org.postgresql.core.v3.QueryExecutorImpl.endCopy(QueryExecutorImpl.java:828) - locked <43521773> (a org.postgresql.core.v3.QueryExecutorImpl) at org.postgresql.core.v3.CopyInImpl.endCopy(CopyInImpl.java:59) at org.postgresql.copy.CopyManager.copyIn(CopyManager.java:203) at org.postgresql.copy.CopyManager.copyIn(CopyManager.java:183) … This is the only query running on the server at the time when we get the hang. I’m unable to cancel or terminate the query,I have to kill -9 the pid from the command line. When the COPY operations come in at a more limited pace, we don’tsee this issue. (SELECT pid, client_port, now() - query_start AS "runtime", query_start, datname, state, wait_event_type, query, usename FROM pg_stat_activity WHERE query !~ 'pg_stat_activity' AND state != 'idle' ORDER BY state, runtime DESC;) 12976 14322 0 years 0 mons 0 days 13 hours 35 mins 5.073483 secs 2021-03-23 07:02:22.892034 tapesystem active COPY ds3.s3_object_property (id, key, object_id, value) FROM STDIN WITH DELIMITER AS '|' Administrator Thoughts? Thanks, Robert
Robert Creager <robert@logicalchaos.org> writes: > We’re getting occasional hangs when using COPY FROM STDIN. You mean the backend hangs? Can you get a stack trace from that? https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend regards, tom lane
> On Mar 23, 2021, at 3:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Creager <robert@logicalchaos.org> writes: >> We’re getting occasional hangs when using COPY FROM STDIN. > > You mean the backend hangs? Can you get a stack trace from that? The entire backend doesn’t hang, we’re able to process queries on other connections. It’s just the COPY FROM thread that’shung. > > https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x00000000006a0d5b in UnpinBuffer () (gdb) bt #0 0x00000000006a0d5b in UnpinBuffer () #1 0x00000000006a0abe in ReleaseAndReadBuffer () #2 0x00000000004a9c00 in _bt_relandgetbuf () #3 0x00000000004acbc6 in _bt_moveright () #4 0x00000000004aca0a in _bt_search () #5 0x00000000004a6a6e in _bt_doinsert () #6 0x00000000004abac6 in btinsert () #7 0x00000000005b52b0 in ExecInsertIndexTuples () #8 0x0000000000561de0 in CopyFromInsertBatch () #9 0x000000000055c3cd in DoCopy () #10 0x00000000006d46f1 in standard_ProcessUtility () #11 0x00000000006d3c69 in PortalRunUtility () #12 0x00000000006d3346 in PortalRunMulti () #13 0x00000000006d2d59 in PortalRun () #14 0x00000000006d11c8 in PostgresMain () #15 0x000000000066b224 in BackendRun () #16 0x000000000066a8e6 in ServerLoop () #17 0x00000000006679c0 in PostmasterMain () #18 0x00000000005f4da7 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x00000000006c2c88 in LWLockRelease () (gdb) bt #0 0x00000000006c2c88 in LWLockRelease () #1 0x000000000069fd3c in ReadBuffer_common () #2 0x000000000069fb6c in ReadBufferExtended () #3 0x00000000004a9c00 in _bt_relandgetbuf () #4 0x00000000004acbc6 in _bt_moveright () #5 0x00000000004aca0a in _bt_search () #6 0x00000000004a6a6e in _bt_doinsert () #7 0x00000000004abac6 in btinsert () #8 0x00000000005b52b0 in ExecInsertIndexTuples () #9 0x0000000000561de0 in CopyFromInsertBatch () #10 0x000000000055c3cd in DoCopy () #11 0x00000000006d46f1 in standard_ProcessUtility () #12 0x00000000006d3c69 in PortalRunUtility () #13 0x00000000006d3346 in PortalRunMulti () #14 0x00000000006d2d59 in PortalRun () #15 0x00000000006d11c8 in PostgresMain () #16 0x000000000066b224 in BackendRun () #17 0x000000000066a8e6 in ServerLoop () #18 0x00000000006679c0 in PostmasterMain () #19 0x00000000005f4da7 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x00000000004accba in _bt_moveright () (gdb) bt #0 0x00000000004accba in _bt_moveright () #1 0x00000000004aca0a in _bt_search () #2 0x00000000004a6a6e in _bt_doinsert () #3 0x00000000004abac6 in btinsert () #4 0x00000000005b52b0 in ExecInsertIndexTuples () #5 0x0000000000561de0 in CopyFromInsertBatch () #6 0x000000000055c3cd in DoCopy () #7 0x00000000006d46f1 in standard_ProcessUtility () #8 0x00000000006d3c69 in PortalRunUtility () #9 0x00000000006d3346 in PortalRunMulti () #10 0x00000000006d2d59 in PortalRun () #11 0x00000000006d11c8 in PostgresMain () #12 0x000000000066b224 in BackendRun () #13 0x000000000066a8e6 in ServerLoop () #14 0x00000000006679c0 in PostmasterMain () #15 0x00000000005f4da7 in main () (gdb) detach
Robert Creager <robert@logicalchaos.org> writes: > On Mar 23, 2021, at 3:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> You mean the backend hangs? Can you get a stack trace from that? > The entire backend doesn’t hang, we’re able to process queries on other connections. It’s just the COPY FROM thread that’shung. Hmm, that set of traces doesn't really look like it's hung, it might just be taking a long time. A different theory, which is hard to prove without debug symbols to provide more detail in the trace, is that it might be looping due to following a closed loop of btree page right-links. Which shouldn't happen normally, but perhaps this index is corrupt. Can you try reindexing the indexes of the COPY's target table? regards, tom lane
On Mar 23, 2021, at 3:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:Robert Creager <robert@logicalchaos.org> writes:On Mar 23, 2021, at 3:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:You mean the backend hangs? Can you get a stack trace from that?The entire backend doesn’t hang, we’re able to process queries on other connections. It’s just the COPY FROM thread that’s hung.
Hmm, that set of traces doesn't really look like it's hung,
it might just be taking a long time.
A different theory, which is hard to prove without debug symbols
to provide more detail in the trace, is that it might be looping
due to following a closed loop of btree page right-links. Which
shouldn't happen normally, but perhaps this index is corrupt.
Can you try reindexing the indexes of the COPY's target table?
The REINDEX hangs on locks, I’d have to kill the COPY and reproduce it to get it running, wouldn’t I.
SELECT pid,
client_port,
now() - query_start AS "runtime",
query_start,
datname,
state,
wait_event_type,
query,
usename
FROM pg_stat_activity
WHERE query !~ 'pg_stat_activity' AND
state != 'idle'
ORDER BY state, runtime DESC;
pid | client_port | runtime | query_start | datname | state | wait_event_type | query | usename |
---|---|---|---|---|---|---|---|---|
12976 | 14322 | 0 years 0 mons 0 days 14 hours 47 mins 1.687744 secs | 2021-03-23 07:02:22.892034 | tapesystem | active | NULL | COPY ds3.s3_object_property (id, key, object_id, value) FROM STDIN WITH DELIMITER AS '|' | Administrator |
48638 | 32868 | 0 years 0 mons 0 days 0 hours 6 mins 23.497398 secs | 2021-03-23 21:43:01.082380 | tapesystem | active | Lock | REINDEX TABLE ds3.s3_object_property | Administrator |
WITH RECURSIVE
c(requested, CURRENT) AS (VALUES ('AccessShareLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
('RowShareLock' :: TEXT, 'ExclusiveLock' :: TEXT),
('RowShareLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
('RowExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
('RowExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
('RowExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
('RowExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
('ShareUpdateExclusiveLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
('ShareUpdateExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
('ShareUpdateExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
('ShareUpdateExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
('ShareUpdateExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
('ShareLock' :: TEXT, 'RowExclusiveLock' :: TEXT),
('ShareLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
('ShareLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
('ShareLock' :: TEXT, 'ExclusiveLock' :: TEXT),
('ShareLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
('ShareRowExclusiveLock' :: TEXT, 'RowExclusiveLock' :: TEXT),
('ShareRowExclusiveLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
('ShareRowExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
('ShareRowExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
('ShareRowExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
('ShareRowExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
('ExclusiveLock' :: TEXT, 'RowShareLock' :: TEXT),
('ExclusiveLock' :: TEXT, 'RowExclusiveLock' :: TEXT),
('ExclusiveLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
('ExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
('ExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
('ExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
('ExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'AccessShareLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'RowShareLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'RowExclusiveLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'ShareUpdateExclusiveLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'ShareLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'ShareRowExclusiveLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'ExclusiveLock' :: TEXT),
('AccessExclusiveLock' :: TEXT, 'AccessExclusiveLock' :: TEXT)),
l AS
(
SELECT (locktype, DATABASE, relation :: REGCLASS :: TEXT, page, tuple, virtualxid, transactionid, classid, objid, objsubid) AS target,
virtualtransaction,
pid,
mode,
GRANTED
FROM pg_catalog.pg_locks
),
t AS
(
SELECT blocker.target AS blocker_target,
blocker.pid AS blocker_pid,
blocker.mode AS blocker_mode,
blocked.target AS target,
blocked.pid AS pid,
blocked.mode AS mode
FROM l blocker
JOIN l blocked
ON (NOT blocked.GRANTED
AND blocker.GRANTED
AND blocked.pid != blocker.pid
AND blocked.target IS NOT DISTINCT FROM blocker.target)
JOIN c ON (c.requested = blocked.mode AND c.CURRENT = blocker.mode)
),
r AS
(
SELECT blocker_target,
blocker_pid,
blocker_mode,
'1' :: INT AS depth,
target,
pid,
mode,
blocker_pid :: TEXT || ',' || pid :: TEXT AS seq
FROM t
UNION ALL
SELECT blocker.blocker_target,
blocker.blocker_pid,
blocker.blocker_mode,
blocker.depth + 1,
blocked.target,
blocked.pid,
blocked.mode,
blocker.seq || ',' || blocked.pid :: TEXT
FROM r blocker
JOIN t blocked
ON (blocked.blocker_pid = blocker.pid)
WHERE blocker.depth < 1000
)
blocker_target | blocker_pid | blocker_mode | depth | target | pid | mode | seq |
---|---|---|---|---|---|---|---|
(relation,16954,ds3.s3_object_property,,,,,,,) | 12976 | RowExclusiveLock | 1 | (relation,16954,ds3.s3_object_property,,,,,,,) | 48638 | ShareLock | 12976,48638 |
\
On Mar 23, 2021, at 3:51 PM, Robert Creager <robert@logicalchaos.org> wrote:A different theory, which is hard to prove without debug symbols
to provide more detail in the trace, is that it might be looping
due to following a closed loop of btree page right-links. Which
shouldn't happen normally, but perhaps this index is corrupt.
Can you try reindexing the indexes of the COPY's target table?
Thanks Tom, since the reindex we’ve been unable to reproduce the problem again.
The big question is do you have any random thoughts on how a index may of gotten corrupted? The db lives on a two SSC ZFS mirror, so I’d not expect disk corruption.
Best,
Robert