Thread: Segfault when a replica trigger tries to refresh subscription
Greetings!
I have found that executing ALTER SUBSCRIPTION foo REFRESH PUBLICATION from a replica trigger running under the same subscription foo causes a segfault. I have only attempted to reproduce this on pg 11.9.
To reproduce:
Publisher:
CREATE PUBLICATION foo;
CREATE TABLE bar (id serial primary key);
GRANT SELECT ON TABLE bar TO replication_role;
ALTER PUBLICATION foo ADD TABLE bar;
Subscriber:
CREATE TABLE bar (id serial primary key);
CREATE SUBSCRIPTION foo CONNECTION 'host=localhost dbname=foo port=5432 user=replication_role password=foo' PUBLICATION foo;
Subscriber:
CREATE TABLE bar (id serial primary key);
CREATE SUBSCRIPTION foo CONNECTION 'host=localhost dbname=foo port=5432 user=replication_role password=foo' PUBLICATION foo;
CREATE FUNCTION refresh_sub() RETURNS TRIGGER AS $BODY$
BEGIN
ALTER SUBSCRIPTION foo REFRESH PUBLICATION;
ALTER SUBSCRIPTION foo REFRESH PUBLICATION;
RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;
END;
$BODY$
LANGUAGE plpgsql;
CREATE TRIGGER refresh_sub
BEFORE INSERT ON bar
FOR EACH ROW EXECUTE PROCEDURE refresh_sub();
BEFORE INSERT ON bar
FOR EACH ROW EXECUTE PROCEDURE refresh_sub();
ALTER TABLE bar ENABLE REPLICA TRIGGER refresh_sub;
Now on provider, this will create a segfault on subscriber:
INSERT INTO bar (id) VALUES (1);
Logs:
2020-11-11 18:14:09.535 GMT,,,21636,,5fa58af6.5484,8583,,2020-11-06 17:42:14 GMT,,0,LOG,00000,"background worker ""logical replication worker"" (PID 24492) wa
s terminated by signal 11: Segmentation fault",,,,,,,,,""
2020-11-11 18:14:09.535 GMT,,,21636,,5fa58af6.5484,8584,,2020-11-06 17:42:14 GMT,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,""
2020-11-11 18:14:09.535 GMT,,,22766,,5fac286d.58ee,1,,2020-11-11 18:07:41 GMT,1/0,0,WARNING,57P02,"terminating connection because of crash of another server p
rocess","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and
possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,""
2020-11-11 18:14:09.537 GMT,,,21636,,5fa58af6.5484,8585,,2020-11-06 17:42:14 GMT,,0,LOG,00000,"all server processes terminated; reinitializing",,,,,,,,,""
s terminated by signal 11: Segmentation fault",,,,,,,,,""
2020-11-11 18:14:09.535 GMT,,,21636,,5fa58af6.5484,8584,,2020-11-06 17:42:14 GMT,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,""
2020-11-11 18:14:09.535 GMT,,,22766,,5fac286d.58ee,1,,2020-11-11 18:07:41 GMT,1/0,0,WARNING,57P02,"terminating connection because of crash of another server p
rocess","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and
possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,""
2020-11-11 18:14:09.537 GMT,,,21636,,5fa58af6.5484,8585,,2020-11-06 17:42:14 GMT,,0,LOG,00000,"all server processes terminated; reinitializing",,,,,,,,,""
I have been working on adding support for native logical replication to pgl_ddl_deploy, and this is the reason for the above command which executes from a ddl queue table. The purpose is to automatically add new tables to replication which is not possible to do only from the provider with native logical replication as it is with pglogical.
Thanks,
Jeremy
Hi, On 2020-11-11 12:23:06 -0600, Jeremy Finzel wrote: > I have only attempted to reproduce this on pg 11.9. It's reproducible on HEAD. > Logs: > 2020-11-11 18:14:09.535 GMT,,,21636,,5fa58af6.5484,8583,,2020-11-06 > 17:42:14 GMT,,0,LOG,00000,"background worker ""logical replication worker"" > (PID 24492) wa > s terminated by signal 11: Segmentation fault",,,,,,,,,"" The backtrace is #0 __GI___libc_free (mem=0x7f7f7f7f7f7f7f7f) at malloc.c:3102 #1 0x00007ffbef93cea8 in PQfreemem (ptr=0x7f7f7f7f7f7f7f7f) at /home/andres/src/postgresql/src/interfaces/libpq/fe-exec.c:3298 #2 0x00007ffbf462e4f7 in libpqrcv_receive (conn=0x5637f8d91740, buffer=0x7ffd7d7fd770, wait_fd=0x7ffd7d7fd778) at /home/andres/src/postgresql/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:735 #3 0x00005637f773ec76 in LogicalRepApplyLoop (last_received=23489672) at /home/andres/src/postgresql/src/backend/replication/logical/worker.c:2173 #4 0x00005637f7740733 in ApplyWorkerMain (main_arg=0) at /home/andres/src/postgresql/src/backend/replication/logical/worker.c:3091 #5 0x00005637f76f11dd in StartBackgroundWorker () at /home/andres/src/postgresql/src/backend/postmaster/bgworker.c:820 #6 0x00005637f770672d in do_start_bgworker (rw=0x5637f8caf500) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5837 #7 0x00005637f7706b01 in maybe_start_bgworkers () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:6062 #8 0x00005637f7705a68 in sigusr1_handler (postgres_signal_arg=10) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5223 #9 <signal handler called> #10 0x00007ffbf395d7c6 in __GI___select (nfds=8, readfds=0x7ffd7d7fe770, writefds=0x0, exceptfds=0x0, timeout=0x7ffd7d7fe7f0) at ../sysdeps/unix/sysv/linux/select.c:41 #11 0x00005637f77012ca in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1691 #12 0x00005637f7700c67 in PostmasterMain (argc=42, argv=0x5637f8c81e20) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1400 #13 0x00005637f7601934 in main (argc=42, argv=0x5637f8c81e20) at /home/andres/src/postgresql/src/backend/main/main.c:209 i.e. the connection is freed at this point. The problem seems to be that nobody thought about refresh being called in an apply worker. Adding a check for refresh overwriting the worker's connection yields: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007f2242be5537 in __GI_abort () at abort.c:79 #2 0x000055e28c1a7f80 in errfinish (filename=0x7f2243987065 "libpqwalreceiver.c", lineno=706, funcname=0x7f2243987870 <__func__.5>"libpqrcv_disconnect") at /home/andres/src/postgresql/src/backend/utils/error/elog.c:592 #3 0x00007f224398647e in libpqrcv_disconnect (conn=0x55e28d2c4750) at /home/andres/src/postgresql/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:706 #4 0x000055e28bd7aa4f in AlterSubscription_refresh (sub=0x55e28d2c3f40, copy_data=true) at /home/andres/src/postgresql/src/backend/commands/subscriptioncmds.c:583 #5 0x000055e28bd7b4fe in AlterSubscription (stmt=0x55e28d2e0188) at /home/andres/src/postgresql/src/backend/commands/subscriptioncmds.c:880 #6 0x000055e28c021e5e in ProcessUtilitySlow (pstate=0x55e28d2c3e28, pstmt=0x55e28d2e00f0, queryString=0x55e28d2bc448 "ALTER SUBSCRIPTION foo REFRESH PUBLICATION", context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x55e28c4c5980 <spi_printtupDR>, qc=0x7ffe5de60840) at /home/andres/src/postgresql/src/backend/tcop/utility.c:1828 #7 0x000055e28c0204db in standard_ProcessUtility (pstmt=0x55e28d2e00f0, queryString=0x55e28d2bc448 "ALTER SUBSCRIPTION fooREFRESH PUBLICATION", context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x55e28c4c5980 <spi_printtupDR>, qc=0x7ffe5de60840) at /home/andres/src/postgresql/src/backend/tcop/utility.c:1071 #8 0x000055e28c01f595 in ProcessUtility (pstmt=0x55e28d2e00f0, queryString=0x55e28d2bc448 "ALTER SUBSCRIPTION foo REFRESHPUBLICATION", context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x55e28c4c5980 <spi_printtupDR>, qc=0x7ffe5de60840) at /home/andres/src/postgresql/src/backend/tcop/utility.c:524 #9 0x000055e28be3833e in _SPI_execute_plan (plan=0x55e28d2dfc50, paramLI=0x0, snapshot=0x0, crosscheck_snapshot=0x0, read_only=false,fire_triggers=true, tcount=0, caller_dest=0x0) at /home/andres/src/postgresql/src/backend/executor/spi.c:2443 #10 0x000055e28be34d29 in SPI_execute_plan_with_paramlist (plan=0x55e28d2dfc50, params=0x0, read_only=false, tcount=0) at /home/andres/src/postgresql/src/backend/executor/spi.c:578 #11 0x00007f223eb91611 in exec_stmt_execsql (estate=0x7ffe5de60ca0, stmt=0x55e28d2c7fa0) at /home/andres/src/postgresql/src/pl/plpgsql/src/pl_exec.c:4288 #12 0x00007f223eb8ccb6 in exec_stmts (estate=0x7ffe5de60ca0, stmts=0x55e28d2c7ff8) at /home/andres/src/postgresql/src/pl/plpgsql/src/pl_exec.c:2055 #13 0x00007f223eb8c88c in exec_stmt_block (estate=0x7ffe5de60ca0, block=0x55e28d2c80d0) at /home/andres/src/postgresql/src/pl/plpgsql/src/pl_exec.c:1906 #14 0x00007f223eb8c09c in exec_toplevel_block (estate=0x7ffe5de60ca0, block=0x55e28d2c80d0) at /home/andres/src/postgresql/src/pl/plpgsql/src/pl_exec.c:1604 #15 0x00007f223eb8abe9 in plpgsql_exec_trigger (func=0x55e28d1e60d8, trigdata=0x7ffe5de610c0) at /home/andres/src/postgresql/src/pl/plpgsql/src/pl_exec.c:1019 #16 0x00007f223eba4c46 in plpgsql_call_handler (fcinfo=0x7ffe5de61040) at /home/andres/src/postgresql/src/pl/plpgsql/src/pl_handler.c:256 #17 0x000055e28bda9d6b in ExecCallTriggerFunc (trigdata=0x7ffe5de610c0, tgindx=0, finfo=0x55e28d29cd30, instr=0x0, per_tuple_context=0x55e28d2b4120) at /home/andres/src/postgresql/src/backend/commands/trigger.c:2077 #18 0x000055e28bdaa2ed in ExecBRInsertTriggers (estate=0x55e28d2b2230, relinfo=0x55e28d29bc60, slot=0x55e28d29b858) at /home/andres/src/postgresql/src/backend/commands/trigger.c:2206 #19 0x000055e28bde890c in ExecSimpleRelationInsert (resultRelInfo=0x55e28d29bc60, estate=0x55e28d2b2230, slot=0x55e28d29b858) at /home/andres/src/postgresql/src/backend/executor/execReplication.c:422 #20 0x000055e28bf94f29 in apply_handle_insert_internal (relinfo=0x55e28d29bc60, estate=0x55e28d2b2230, remoteslot=0x55e28d29b858) at /home/andres/src/postgresql/src/backend/replication/logical/worker.c:1214 #21 0x000055e28bf94e9d in apply_handle_insert (s=0x7ffe5de61290) at /home/andres/src/postgresql/src/backend/replication/logical/worker.c:1190 #22 0x000055e28bf96659 in apply_dispatch (s=0x7ffe5de61290) at /home/andres/src/postgresql/src/backend/replication/logical/worker.c:1912 #23 0x000055e28bf96bab in LogicalRepApplyLoop (last_received=23489672) at /home/andres/src/postgresql/src/backend/replication/logical/worker.c:2150 It doesn't at all seem necessary for AlterSubscription_refresh() to use the global 'wrconn' variable. This code has basically survived unmodified since commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920 Author: Peter Eisentraut <peter_e@gmx.net> Date: 2017-03-23 08:36:36 -0400 Logical replication support for initial data copy The crash is trivial to fix - simply use a local variable instead of wrconn. There's a few more places than what you hit here that need to be dealt with though. Obviously the code also ought to reset wrconn to NULL after freeing. But uh, isn't this code also just plainly in danger of leaking libpq connections in case something goes wrong? /* Try to connect to the publisher. */ conn = walrcv_connect(sub->conninfo, true, sub->name, &err); if (!conn) ereport(ERROR, (errmsg("could not connect to the publisher: %s", err))); /* Get the table list from publisher. */ pubrel_names = fetch_table_list(conn, sub->publications); /* We are done with the remote side, close connection. */ walrcv_disconnect(conn); What if fetch_table_list() errors out? Oddly other cases seem to pay some attention to this, see e.g. DropSubscription(). Greetings, Andres Freund