An unkillable connection caused replication delay on my replica - Mailing list pgsql-hackers
From | Shawn |
---|---|
Subject | An unkillable connection caused replication delay on my replica |
Date | |
Msg-id | 1467143803715-5909767.post@n5.nabble.com Whole thread Raw |
Responses |
Re: An unkillable connection caused replication delay on
my replica
(Alvaro Herrera <alvherre@2ndquadrant.com>)
|
List | pgsql-hackers |
Postgres 9.4.5 Streaming replica hot_standby_feedback=1 I received an alarm that replication was lagging many hours. When I took a look at what was happening on the replica I found the recovery process was "stuck" on one WAL. 8719 8717 1 Apr30 ? 15:24:49 postgres: startup process recovering 0000000100001E9700000096 waiting An strace of the recovery pid showed a loop trying to kill the blocking pid: select(0, NULL, NULL, NULL, {0, 5000}) = 0 (Timeout) gettimeofday({1465523436, 442899}, NULL) = 0 kill(6819, SIGUSR1) = 0 select(0, NULL, NULL, NULL, {0, 5000}) = 0 (Timeout) gettimeofday({1465523436, 448134},NULL) = 0 kill(6819, SIGUSR1) strace of long-running query pid 6819 loops like this: sendto(10, "<query stuff removed>" NULL, 0) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=8719, si_uid=3001} --- rt_sigreturn() Where pid 6819 is a long (running for 2 days) running query. In pg_stat_activity, it was still listed as "active". The query had a horrible execution plan and it was being executed via a python script. I couldn't pg_terminate_backend the connection. I didn't try to "kill -9" it due to all the warnings about that and I felt I had something special here. I attached the debugger. Program received signal SIGINT, Interrupt. 0x00002b89d9c62903 in __select_nocancel () from /lib64/libc.so.6 #0 0x00002b89d9c62903 in __select_nocancel () from /lib64/libc.so.6 #1 0x000000000076e6da in pg_usleep (microsec=microsec@entry=5000) at pgsleep.c:53 #2 0x0000000000655f8e in ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x2b89dc238000, reason=reason@entry=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at standby.c:247 #3 0x000000000065609e in ResolveRecoveryConflictWithVirtualXIDs (reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=<optimized out>) at standby.c:283 #4 ResolveRecoveryConflictWithSnapshot (latestRemovedXid=<optimized out>, node=...) at standby.c:281 #5 0x000000000048ff7d in heap_xlog_clean (record=0x2b89dc1295c0, lsn=33636416003112) at heapam.c:7329 #6 heap2_redo (lsn=33636416003112, record=0x2b89dc1295c0) at heapam.c:8487 #7 0x00000000004c5fad in StartupXLOG () at xlog.c:6948 #8 0x000000000061f2bf in StartupProcessMain () at startup.c:224 #9 0x00000000004d0188 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fff98c4cba0) at bootstrap.c:422 #10 0x000000000061c8de in StartChildProcess (type=StartupProcess) at postmaster.c:5256 #11 0x000000000061edd2 in PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x2b89dc024040) at postmaster.c:1292 #12 0x00000000004622e5 in main (argc=7, argv=0x2b89dc024040) at main.c:228 Continuing. Which I believe puts me in this section of code: https://github.com/postgres/postgres/blob/REL9_4_STABLE/src/backend/storage/ipc/standby.c#L232 The customer was able to restart recovery but only after they "kill -9" the python script on their application host. My question: Is there something special about this connection that prevents recovery from being able to kill it like it wanted to? I am trying to repro it, but, so far my connections get terminated properly. Can/should Postgres be more forceful when its determined WaitExceedsMaxStandbyDelay is true and a session shouldn't be blocking recovery? Shawn -- View this message in context: http://postgresql.nabble.com/An-unkillable-connection-caused-replication-delay-on-my-replica-tp5909767.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
pgsql-hackers by date: