Thread: BUG #15617: walsender hang if sync replica disconnected from network

BUG #15617: walsender hang if sync replica disconnected from network

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15617
Logged by:          Evgeny Efimkin
Email address:      efimkin@yandex-team.ru
PostgreSQL version: 10.6
Operating system:   Ubuntu 14.04.5 LTS (GNU/Linux 4.4.52-25 x86_64)
Description:

I have a very rare problem when wal_sender_timeout didn't work, if sync
replica have network problem (disconnected). I can't reproduce it test
env.

Some diagnostic

xdb41f/postgres M # select version();
                                                              version

-----------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 10.6 (Ubuntu 10.6-1.pgdg14.04+1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4, 64-bit
(1 row)

Time: 0.214 ms

xdb41f/postgres M # show wal_sender_timeout ;
 wal_sender_timeout
--------------------
 1min
(1 row)

pg_stat_activity:
xdb41f/postgres M # select * from pg_stat_activity where pid=556566;
-[ RECORD 1 ]----+----------------------------------
datid            | [null]
datname          | [null]
pid              | 556566
usesysid         | 16403
usename          | repl
application_name | xdb41e_mail_yandex_net
client_addr      | 2a02:6b8:0:801:ec4:7aff:fe52:cd5e
client_hostname  | xdb41e.mail.yandex.net
client_port      | 39132
backend_start    | 2018-12-13 03:25:16.891091+03
xact_start       | [null]
query_start      | [null]
state_change     | 2018-12-13 03:25:16.906387+03
wait_event_type  | Client
wait_event       | ClientWrite
state            | active
backend_xid      | [null]
backend_xmin     | [null]
query            |
backend_type     | walsender

backtrace on walsender process:
Wed Jan 30 18:20:10 2019
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fd1b38bf6b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:81
#0  0x00007fd1b38bf6b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:81
#1  0x000055b85a26f4d9 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ffce744cfd0, cur_timeout=-1, set=0x55b85c5e3d98) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/storage/ipc/latch.c:1048

#2  WaitEventSetWait (set=0x55b85c5e3d98, timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ffce744cfd0,
nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=100663297) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/storage/ipc/latch.c:1000

#3  0x000055b85a18da5f in secure_write (port=0x55b85c6380a0,
ptr=ptr@entry=0x7fd1b5cb0048, len=len@entry=131102) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/libpq/be-secure.c:273
#4  0x000055b85a198cc6 in internal_flush () at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/libpq/pqcomm.c:1433
#5  0x000055b85a198ee1 in internal_putbytes (s=s@entry=0x7ffce744d0ac "S",
len=len@entry=1) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/libpq/pqcomm.c:1379
#6  0x000055b85a199002 in socket_putmessage (msgtype=83 'S',
s=0x55b85c65ebd8 "client_encoding", len=26) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/libpq/pqcomm.c:1566
#7  0x000055b85a19b024 in pq_endmessage (buf=buf@entry=0x7ffce744d100) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/libpq/pqformat.c:347
#8  0x000055b85a3b22a9 in ReportGUCOption (record=0x55b85a80e7f8
<ConfigureNamesString+152>) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/utils/misc/guc.c:5408
#9  0x000055b85a3b39d8 in set_config_option (name=name@entry=0x55b85a5389ef
"client_encoding", value=0x55b85a52a4a2 "SQL_ASCII",
context=context@entry=PGC_BACKEND,
source=source@entry=PGC_S_DYNAMIC_DEFAULT,
action=action@entry=GUC_ACTION_SET, changeVal=changeVal@entry=1 '\001',
elevel=<optimized out>, elevel@entry=0, is_reload=is_reload@entry=0 '\000')
at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/utils/misc/guc.c:6638
#10 0x000055b85a3b4275 in SetConfigOption (name=name@entry=0x55b85a5389ef
"client_encoding", value=<optimized out>, context=context@entry=PGC_BACKEND,
source=source@entry=PGC_S_DYNAMIC_DEFAULT) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/utils/misc/guc.c:6685
#11 0x000055b85a3b94f0 in ProcessConfigFileInternal
(context=context@entry=PGC_SIGHUP, applySettings=applySettings@entry=1
'\001', elevel=elevel@entry=13) at guc-file.l:411
#12 0x000055b85a3b9974 in ProcessConfigFile (context=   ) at
guc-file.l:155
#13 0x000055b85a247cd0 in WalSndLoop
(send_data=send_data@entry=0x55b85a247790 <XLogSendPhysical>) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/replication/walsender.c:2132
#14 0x000055b85a2482bc in StartReplication (cmd=0x55b85c600de8) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/replication/walsender.c:684
#15 exec_replication_command (cmd_string=cmd_string@entry=0x55b85c707658
"START_REPLICATION SLOT \"xdb41e_mail_yandex_net\" 57AC/BC000000 TIMELINE
1") at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/replication/walsender.c:1535
#16 0x000055b85a2910e6 in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x55b85c63a8e0, dbname=0x55b85c63a820 "",
username=<optimized out>) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/tcop/postgres.c:4113
#17 0x000055b859fd44b5 in BackendRun (port=0x55b85c6380a0) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:4405
#18 BackendStartup (port=0x55b85c6380a0) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:4077
#19 ServerLoop () at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:1755
#20 0x000055b85a22316e in PostmasterMain (argc=11, argv=<optimized out>) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:1363
#21 0x000055b859fd5406 in main (argc=11, argv=0x55b85c5e21f0) at
/build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/main/main.c:228

Any idea why it happened? If it happen in next time, which(and how)
diagnostic need to collect?


Re: BUG #15617: walsender hang if sync replica disconnected fromnetwork

From
Evgeniy Efimkin
Date:
Hi!
I reproduced it.
1. Make some write load on master: `pgbench -i -s 100000 testdb`
2. On sync replica drop output connections, while pgbench init database: `ip6tables -A OUTPUT -p tcp --dport 5432 -j
DROP`
3. Until was_sender_timeout reached, i send SIGHUB on master: `select pg_reload_conf();`
After that walsender forever stay in WaitEventSetWait, because no timeout

src/backend/libpq/be-secure.c:273
WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */ , &event, 1,
                 WAIT_EVENT_CLIENT_READ);

Re: BUG #15617: walsender hang if sync replica disconnected from network

From
Evgeniy Efimkin
Date:
Hi!
Maybe need additional diagnostic?

02.02.2019, 20:41, "Evgeniy Efimkin" <efimkin@yandex-team.ru>:
> Hi!
> I reproduced it.
> 1. Make some write load on master: `pgbench -i -s 100000 testdb`
> 2. On sync replica drop output connections, while pgbench init database: `ip6tables -A OUTPUT -p tcp --dport 5432 -j
DROP`
> 3. Until was_sender_timeout reached, i send SIGHUB on master: `select pg_reload_conf();`
> After that walsender forever stay in WaitEventSetWait, because no timeout
>
> src/backend/libpq/be-secure.c:273
> WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */ , &event, 1,
>                  WAIT_EVENT_CLIENT_READ);

-------- 
Efimkin Evgeny