An unkillable connection caused replication delay on my replica - Mailing list pgsql-hackers

From Shawn
Subject An unkillable connection caused replication delay on my replica
Date
Msg-id 1467143803715-5909767.post@n5.nabble.com
Whole thread Raw
Responses Re: An unkillable connection caused replication delay on my replica  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
Postgres 9.4.5
Streaming replica
hot_standby_feedback=1

I received an alarm that replication was lagging many hours.  When I took a
look at what was happening on the replica I found the recovery process was
"stuck" on one WAL.
    8719  8717  1 Apr30 ?        15:24:49 postgres: startup process  
recovering 0000000100001E9700000096 waiting

An strace of the recovery pid showed a loop trying to kill the blocking pid:
    select(0, NULL, NULL, NULL, {0, 5000})  = 0 (Timeout)    gettimeofday({1465523436, 442899}, NULL) = 0    kill(6819,
SIGUSR1)                    = 0    select(0, NULL, NULL, NULL, {0, 5000})  = 0 (Timeout)    gettimeofday({1465523436,
448134},NULL) = 0    kill(6819, SIGUSR1)    
 

strace of long-running query pid 6819 loops like this:
    sendto(10, "<query stuff removed>" NULL, 0) = ? ERESTARTSYS (To be
restarted if SA_RESTART is set)    --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=8719,
si_uid=3001} ---    rt_sigreturn()       

Where pid 6819 is a long (running for 2 days) running query.  In
pg_stat_activity, it was still listed as "active".  The query had a horrible
execution plan and it was being executed via a python script.  I couldn't
pg_terminate_backend the connection.  I didn't try to "kill -9" it due to
all the warnings about that and I felt I had something special here.  I
attached the debugger.

Program received signal SIGINT, Interrupt.
0x00002b89d9c62903 in __select_nocancel () from /lib64/libc.so.6
#0 0x00002b89d9c62903 in __select_nocancel () from /lib64/libc.so.6
#1 0x000000000076e6da in pg_usleep (microsec=microsec@entry=5000) at
pgsleep.c:53
#2 0x0000000000655f8e in ResolveRecoveryConflictWithVirtualXIDs
(waitlist=0x2b89dc238000,
reason=reason@entry=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at standby.c:247
#3 0x000000000065609e in ResolveRecoveryConflictWithVirtualXIDs
(reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=<optimized out>) at
standby.c:283
#4 ResolveRecoveryConflictWithSnapshot (latestRemovedXid=<optimized out>,
node=...) at standby.c:281
#5 0x000000000048ff7d in heap_xlog_clean (record=0x2b89dc1295c0,
lsn=33636416003112) at heapam.c:7329
#6 heap2_redo (lsn=33636416003112, record=0x2b89dc1295c0) at heapam.c:8487
#7 0x00000000004c5fad in StartupXLOG () at xlog.c:6948
#8 0x000000000061f2bf in StartupProcessMain () at startup.c:224
#9 0x00000000004d0188 in AuxiliaryProcessMain (argc=argc@entry=2,
argv=argv@entry=0x7fff98c4cba0) at bootstrap.c:422
#10 0x000000000061c8de in StartChildProcess (type=StartupProcess) at
postmaster.c:5256
#11 0x000000000061edd2 in PostmasterMain (argc=argc@entry=7,
argv=argv@entry=0x2b89dc024040) at postmaster.c:1292
#12 0x00000000004622e5 in main (argc=7, argv=0x2b89dc024040) at main.c:228
Continuing.

Which I believe puts me in this section of code:
   
https://github.com/postgres/postgres/blob/REL9_4_STABLE/src/backend/storage/ipc/standby.c#L232

The customer was able to restart recovery but only after they "kill -9" the
python script on their application host.  

My question:  Is there something special about this connection that prevents
recovery from being able to kill it like it wanted to?  I am trying to repro
it, but, so far my connections get terminated properly.  Can/should Postgres
be more forceful when its determined WaitExceedsMaxStandbyDelay is true and
a session shouldn't be blocking recovery?

Shawn



--
View this message in context:
http://postgresql.nabble.com/An-unkillable-connection-caused-replication-delay-on-my-replica-tp5909767.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Questionabl description in datatype.sgml
Next
From: Alvaro Herrera
Date:
Subject: Re: An unkillable connection caused replication delay on my replica