Thread: BUG #15617: walsender hang if sync replica disconnected from network
BUG #15617: walsender hang if sync replica disconnected from network
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15617 Logged by: Evgeny Efimkin Email address: efimkin@yandex-team.ru PostgreSQL version: 10.6 Operating system: Ubuntu 14.04.5 LTS (GNU/Linux 4.4.52-25 x86_64) Description: I have a very rare problem when wal_sender_timeout didn't work, if sync replica have network problem (disconnected). I can't reproduce it test env. Some diagnostic xdb41f/postgres M # select version(); version ----------------------------------------------------------------------------------------------------------------------------------- PostgreSQL 10.6 (Ubuntu 10.6-1.pgdg14.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4, 64-bit (1 row) Time: 0.214 ms xdb41f/postgres M # show wal_sender_timeout ; wal_sender_timeout -------------------- 1min (1 row) pg_stat_activity: xdb41f/postgres M # select * from pg_stat_activity where pid=556566; -[ RECORD 1 ]----+---------------------------------- datid | [null] datname | [null] pid | 556566 usesysid | 16403 usename | repl application_name | xdb41e_mail_yandex_net client_addr | 2a02:6b8:0:801:ec4:7aff:fe52:cd5e client_hostname | xdb41e.mail.yandex.net client_port | 39132 backend_start | 2018-12-13 03:25:16.891091+03 xact_start | [null] query_start | [null] state_change | 2018-12-13 03:25:16.906387+03 wait_event_type | Client wait_event | ClientWrite state | active backend_xid | [null] backend_xmin | [null] query | backend_type | walsender backtrace on walsender process: Wed Jan 30 18:20:10 2019 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007fd1b38bf6b3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81 #0 0x00007fd1b38bf6b3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81 #1 0x000055b85a26f4d9 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffce744cfd0, cur_timeout=-1, set=0x55b85c5e3d98) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/storage/ipc/latch.c:1048 #2 WaitEventSetWait (set=0x55b85c5e3d98, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffce744cfd0, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=100663297) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/storage/ipc/latch.c:1000 #3 0x000055b85a18da5f in secure_write (port=0x55b85c6380a0, ptr=ptr@entry=0x7fd1b5cb0048, len=len@entry=131102) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/libpq/be-secure.c:273 #4 0x000055b85a198cc6 in internal_flush () at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/libpq/pqcomm.c:1433 #5 0x000055b85a198ee1 in internal_putbytes (s=s@entry=0x7ffce744d0ac "S", len=len@entry=1) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/libpq/pqcomm.c:1379 #6 0x000055b85a199002 in socket_putmessage (msgtype=83 'S', s=0x55b85c65ebd8 "client_encoding", len=26) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/libpq/pqcomm.c:1566 #7 0x000055b85a19b024 in pq_endmessage (buf=buf@entry=0x7ffce744d100) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/libpq/pqformat.c:347 #8 0x000055b85a3b22a9 in ReportGUCOption (record=0x55b85a80e7f8 <ConfigureNamesString+152>) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/utils/misc/guc.c:5408 #9 0x000055b85a3b39d8 in set_config_option (name=name@entry=0x55b85a5389ef "client_encoding", value=0x55b85a52a4a2 "SQL_ASCII", context=context@entry=PGC_BACKEND, source=source@entry=PGC_S_DYNAMIC_DEFAULT, action=action@entry=GUC_ACTION_SET, changeVal=changeVal@entry=1 '\001', elevel=<optimized out>, elevel@entry=0, is_reload=is_reload@entry=0 '\000') at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/utils/misc/guc.c:6638 #10 0x000055b85a3b4275 in SetConfigOption (name=name@entry=0x55b85a5389ef "client_encoding", value=<optimized out>, context=context@entry=PGC_BACKEND, source=source@entry=PGC_S_DYNAMIC_DEFAULT) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/utils/misc/guc.c:6685 #11 0x000055b85a3b94f0 in ProcessConfigFileInternal (context=context@entry=PGC_SIGHUP, applySettings=applySettings@entry=1 '\001', elevel=elevel@entry=13) at guc-file.l:411 #12 0x000055b85a3b9974 in ProcessConfigFile (context= ) at guc-file.l:155 #13 0x000055b85a247cd0 in WalSndLoop (send_data=send_data@entry=0x55b85a247790 <XLogSendPhysical>) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/replication/walsender.c:2132 #14 0x000055b85a2482bc in StartReplication (cmd=0x55b85c600de8) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/replication/walsender.c:684 #15 exec_replication_command (cmd_string=cmd_string@entry=0x55b85c707658 "START_REPLICATION SLOT \"xdb41e_mail_yandex_net\" 57AC/BC000000 TIMELINE 1") at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/replication/walsender.c:1535 #16 0x000055b85a2910e6 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x55b85c63a8e0, dbname=0x55b85c63a820 "", username=<optimized out>) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/tcop/postgres.c:4113 #17 0x000055b859fd44b5 in BackendRun (port=0x55b85c6380a0) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:4405 #18 BackendStartup (port=0x55b85c6380a0) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:4077 #19 ServerLoop () at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:1755 #20 0x000055b85a22316e in PostmasterMain (argc=11, argv=<optimized out>) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:1363 #21 0x000055b859fd5406 in main (argc=11, argv=0x55b85c5e21f0) at /build/postgresql-10-hXjK9Z/postgresql-10-10.6/build/../src/backend/main/main.c:228 Any idea why it happened? If it happen in next time, which(and how) diagnostic need to collect?
Hi! I reproduced it. 1. Make some write load on master: `pgbench -i -s 100000 testdb` 2. On sync replica drop output connections, while pgbench init database: `ip6tables -A OUTPUT -p tcp --dport 5432 -j DROP` 3. Until was_sender_timeout reached, i send SIGHUB on master: `select pg_reload_conf();` After that walsender forever stay in WaitEventSetWait, because no timeout src/backend/libpq/be-secure.c:273 WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */ , &event, 1, WAIT_EVENT_CLIENT_READ);
Hi! Maybe need additional diagnostic? 02.02.2019, 20:41, "Evgeniy Efimkin" <efimkin@yandex-team.ru>: > Hi! > I reproduced it. > 1. Make some write load on master: `pgbench -i -s 100000 testdb` > 2. On sync replica drop output connections, while pgbench init database: `ip6tables -A OUTPUT -p tcp --dport 5432 -j DROP` > 3. Until was_sender_timeout reached, i send SIGHUB on master: `select pg_reload_conf();` > After that walsender forever stay in WaitEventSetWait, because no timeout > > src/backend/libpq/be-secure.c:273 > WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */ , &event, 1, > WAIT_EVENT_CLIENT_READ); -------- Efimkin Evgeny