Thread: Segfault logical replication PG 10.4
We discovered our pg_wal partition was full few days after setting our first logical publication on a PG 10.4 instance.
Then, we can not synchronise our slave to the master, it triggers a segfault on the slave. We had to drop manually the subscription on slave and the slot on master.
Then, we wanted to find the cause of this bug, stop connection between master and slave , after 30 minutes, the slave had a segfault and could not synchronise.
Why does the slave can not synchronise without a complete creation subscription after dropping the slot?
How to manage the replication, knowing we use cloud vm and issue network latency.
Here the details of conf and error logs:
Conf on master:
max_replication_slots = 10
max_sync_workers_per_subscription = 2
wal_receiver_timeout: 60s
wal_keep_segments : 1000
wal_receiver_status_interval :10
wal_retrieve_retry_interval :5 s
max_logical_replication_workers :4
Conf on slave
same except wal_keep_segments=0
Error log on slave:
LOG: logical replication apply worker for subscription « XXXX" has started
DEBUG: connecting to publisher using connection string "postgresql://USER@IP"
LOG: worker process: logical replication worker for subscription 132253 (PID 25359) was terminated by signal 11: Segmentation fault
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly co
rrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: all server processes terminated; reinitializing
DEBUG: unregistering background worker "logical replication worker for subscription 132253"
LOG: database system was interrupted; last known up at 2018-07-11 21:50:56 UTC
DEBUG: checkpoint record is at 0/7DBFEF10
DEBUG: redo record is at 0/7DBFEF10; shutdown TRUE
DEBUG: next transaction ID: 0:93714; next OID: 140237
DEBUG: next MultiXactId: 1; next MultiXactOffset: 0
DEBUG: oldest unfrozen transaction ID: 548, in database 1
DEBUG: oldest MultiXactId: 1, in database 1
DEBUG: commit timestamp Xid oldest/newest: 0/0
DEBUG: transaction ID wrap limit is 2147484195, limited by database with OID 1
DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1
DEBUG: starting up replication slots
LOG: recovered replication state of node 2 to 0/0
LOG: recovered replication state of node 3 to 0/0
LOG: recovered replication state of node 4 to 0/0
LOG: recovered replication state of node 5 to 56A5/29ACA918
LOG: database system was not properly shut down; automatic recovery in progress
On 12.07.18 00:56, Mai Peng wrote: > We discovered our pg_wal partition was full few days after setting our > first logical publication on a PG 10.4 instance. > Then, we can not synchronise our slave to the master, it triggers a > segfault on the slave. We had to drop manually the subscription on slave > and the slot on master. > Then, we wanted to find the cause of this bug, stop connection between > master and slave , after 30 minutes, the slave had a segfault and could > not synchronise. > Why does the slave can not synchronise without a complete creation > subscription after dropping the slot? Can you get a backtrace from the core dump produced by the segfault? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello, Here the backtrace ```Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: bgworker: logical replication worker for subscription'. Program terminated with signal SIGSEGV, Segmentation fault. #0 GetActiveSnapshot () at ./build/../src/backend/utils/time/snapmgr.c:843 843 ./build/../src/backend/utils/time/snapmgr.c: No such file or directory. (gdb) up #1 0x000056239fc83ecd in postquel_start (fcache=0x5623a1c83b78, es=0x5623a1c91a70) at ./build/../src/backend/executor/functions.c:809 809 ./build/../src/backend/executor/functions.c: No such file or directory. (gdb) up #2 fmgr_sql (fcinfo=0x5623a1c7af98) at ./build/../src/backend/executor/functions.c:1150 1150 in ./build/../src/backend/executor/functions.c (gdb) up #3 0x000056239fc75b4f in ExecInterpExpr (state=0x5623a1c7aeb8, econtext=0x5623a1c7b898, isnull=<optimized out>) at ./build/../src/backend/executor/execExprInterp.c:672 672 ./build/../src/backend/executor/execExprInterp.c: No such file or directory. (gdb) up #4 0x000056239fc72a96 in ExecEvalExprSwitchContext (isNull=0x7ffdc20d1487 "", econtext=0x5623a1c7b898, state=<optimized out>) at ./build/../src/include/executor/executor.h:308 308 ./build/../src/include/executor/executor.h: No such file or directory. (gdb) up #5 ExecCheck (state=<optimized out>, econtext=econtext@entry=0x5623a1c7b898) at ./build/../src/backend/executor/execExpr.c:556 556 ./build/../src/backend/executor/execExpr.c: No such file or directory. (gdb) up #6 0x000056239fdefec0 in domain_check_input (value=value@entry=94711038065448, isnull=0 '\000', my_extra=my_extra@entry=0x5623a1c79bd0) at ./build/../src/backend/utils/adt/domains.c:180 180 ./build/../src/backend/utils/adt/domains.c: No such file or directory. (gdb) up #7 0x000056239fdf0100 in domain_in (fcinfo=0x7ffdc20d1530) at ./build/../src/backend/utils/adt/domains.c:255 255 in ./build/../src/backend/utils/adt/domains.c (gdb) up #8 0x000056239feccd85 in InputFunctionCall (flinfo=flinfo@entry=0x7ffdc20d1900, str=str@entry=0x5623a1bf4698 "{test.Type.blabla}", typioparam=typioparam@entry=28498, typmod=typmod@entry=-1) at ./build/../src/backend/utils/fmgr/fmgr.c:1640 1640 ./build/../src/backend/utils/fmgr/fmgr.c: No such file or directory. (gdb) up #9 0x000056239fecd01d in OidInputFunctionCall (functionId=<optimized out>, str=0x5623a1bf4698 "{test.Type.blabla}", typioparam=28498, typmod=-1) at ./build/../src/backend/utils/fmgr/fmgr.c:1743 1743 in ./build/../src/backend/utils/fmgr/fmgr.c (gdb) up #10 0x000056239fd6429f in slot_store_cstrings (slot=slot@entry=0x5623a1c664e8, rel=rel@entry=0x5623a1c60bf0, values=values@entry=0x7ffdc20d1a70) at ./build/../src/backend/replication/logical/worker.c:354 354 ./build/../src/backend/replication/logical/worker.c: No such file or directory. (gdb) up #11 0x000056239fd64cc2 in apply_handle_insert (s=s@entry=0x7ffdc20d5580) at ./build/../src/backend/replication/logical/worker.c:612 612 in ./build/../src/backend/replication/logical/worker.c (gdb) up #12 0x000056239fd65e5b in apply_dispatch (s=0x7ffdc20d5580) at ./build/../src/backend/replication/logical/worker.c:912 912 in ./build/../src/backend/replication/logical/worker.c (gdb) up #13 LogicalRepApplyLoop (last_received=97628639619392) at ./build/../src/backend/replication/logical/worker.c:1119 1119 in ./build/../src/backend/replication/logical/worker.c (gdb) up #14 ApplyWorkerMain (main_arg=<optimized out>) at ./build/../src/backend/replication/logical/worker.c:1671 1671 in ./build/../src/backend/replication/logical/worker.c (gdb) up #15 0x000056239fd3540c in StartBackgroundWorker () at ./build/../src/backend/postmaster/bgworker.c:835 835 ./build/../src/backend/postmaster/bgworker.c: No such file or directory. (gdb) up #16 0x000056239fd42175 in do_start_bgworker (rw=0x5623a1b71870) at ./build/../src/backend/postmaster/postmaster.c:5728 5728 ./build/../src/backend/postmaster/postmaster.c: No such file or directory. (gdb) up #17 maybe_start_bgworkers () at ./build/../src/backend/postmaster/postmaster.c:5941 5941 in ./build/../src/backend/postmaster/postmaster.c (gdb) up #18 0x000056239fd42d35 in sigusr1_handler (postgres_signal_arg=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:5121 5121 in ./build/../src/backend/postmaster/postmaster.c (gdb) up #19 <signal handler called> (gdb) up #20 0x00007f533261a3a3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84 84 ../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) up #21 0x000056239fae8c71 in ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1719 1719 ./build/../src/backend/postmaster/postmaster.c: No such file or directory. (gdb) up #22 0x000056239fd4407a in PostmasterMain (argc=3, argv=0x5623a1b49ef0) at ./build/../src/backend/postmaster/postmaster.c:1363 1363 in ./build/../src/backend/postmaster/postmaster.c (gdb) up #23 0x000056239faeac74 in main (argc=3, argv=0x5623a1b49ef0) at ./build/../src/backend/main/main.c:228 228 ./build/../src/backend/main/main.c: No such file or directory.``` BR > Le 14 juil. 2018 à 13:15, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> a écrit : > > On 12.07.18 00:56, Mai Peng wrote: >> We discovered our pg_wal partition was full few days after setting our >> first logical publication on a PG 10.4 instance. >> Then, we can not synchronise our slave to the master, it triggers a >> segfault on the slave. We had to drop manually the subscription on slave >> and the slot on master. >> Then, we wanted to find the cause of this bug, stop connection between >> master and slave , after 30 minutes, the slave had a segfault and could >> not synchronise. >> Why does the slave can not synchronise without a complete creation >> subscription after dropping the slot? > > Can you get a backtrace from the core dump produced by the segfault? > > -- > Peter Eisentraut http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Mai Peng <maily.peng@webedia-group.com> writes: > Here the backtrace Hmm .. so this can be summarized as "logical replication workers should provide an ActiveSnapshot in case the user functions they call want one". Makes me wonder how much other transactional infrastructure is needed but not present. regards, tom lane
Hello , Some new input: On slave, all domains ( with checks) have been replaced by a simple type. No crash on slave since this bypass. Is there something to fix in the ActiveSnapshot code ? BR > Le 18 juil. 2018 à 17:03, Tom Lane <tgl@sss.pgh.pa.us> a écrit : > > Mai Peng <maily.peng@webedia-group.com> writes: >> Here the backtrace > > Hmm .. so this can be summarized as "logical replication workers should > provide an ActiveSnapshot in case the user functions they call want one". > Makes me wonder how much other transactional infrastructure is needed > but not present. > > regards, tom lane
In worker.c:
- in apply_handle_insert, slot_store_cstrings is called before PushActiveSnapshot
- in apply_handle_update & apply_handle_delete, slot_store_cstrings is called after PushActiveSnapshot
/* Process and store remote tuple in the slot */Should this be the cause?
oldctx = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
slot_store_cstrings(remoteslot, rel, newtup.values);
slot_fill_defaults(rel, estate, remoteslot);
MemoryContextSwitchTo(oldctx);
PushActiveSnapshot(GetTransactionSnapshot());
Thank you.
On 2018-Jul-19, Quan TRAN wrote: > Hello, > > In worker.c: > > - in apply_handle_insert, slot_store_cstrings is called before > PushActiveSnapshot > - in apply_handle_update & apply_handle_delete, slot_store_cstrings is > called after PushActiveSnapshot > > /* Process and store remote tuple in the slot */ > oldctx = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate)); > slot_store_cstrings(remoteslot, rel, newtup.values); > slot_fill_defaults(rel, estate, remoteslot); > MemoryContextSwitchTo(oldctx); > > PushActiveSnapshot(GetTransactionSnapshot()); > > Should this be the cause? Yeah, moving the PushActiveSnapshot call to just before slot_store_cstrings sounds right. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2018-Jul-20, Minh-Quan Tran wrote: > Something like this? Can you provide a reproducer for this problem? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Pastebin : https://pastebin.com/Wc80YegY
--
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE EXTENSION IF NOT EXISTS ltree;
CREATE EXTENSION IF NOT EXISTS intarray;
CREATE EXTENSION IF NOT EXISTS hstore;
CREATE EXTENSION IF NOT EXISTS unaccent;
CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE SCHEMA IF NOT EXISTS itscaro;
CREATE DOMAIN public.tag3level AS ltree
constraint tag3level_check CHECK (public.nlevel(VALUE) = 3);
CREATE OR REPLACE FUNCTION public.is_valid_array_tag3level(ltree[])
RETURNS boolean AS
$BODY$
SELECT NOT EXISTS (
SELECT 1
FROM unnest ($1) tag
WHERE tag operator(public.<>) tag::public.tag3level
)
$BODY$
LANGUAGE sql IMMUTABLE STRICT;
CREATE DOMAIN tag3level_array AS ltree[] CHECK ((public.is_valid_array_tag3level(VALUE)));
CREATE TABLE IF NOT EXISTS itscaro.test
(
id int,
tag public.tag3level_array DEFAULT '{}'
);
insert into itscaro.test values (2, '{{test1.test2.test3}');
-- NOTE: initial data copy is done correctly
-- Pub/Sub
-- Master
CREATE PUBLICATION itscaro_test
FOR TABLE itscaro.test;
-- Slave
-- On slave as this moment we have all domains, functions, extensions and testing table
CREATE SUBSCRIPTION itscaro_test
CONNECTION 'postgresql://postgres@<FILL IN MASTER IP>/postgres'
PUBLICATION itscaro_test;
-- Inserting new data after subscription is registered
insert into itscaro.test values (2, '{tefghjfgqst.qdqsdsq.qdqsdgfjhfghqs}'),
(3, '{teqghjghjgfst.qdqsdsq.qdqsdqssqdqsdqsd}'),
(4, '{tjgfjhghjgfjheqst.qdqqsdqsdjghjdsdsq.qdqsdqsqsdqsdqs}'),
(5, '{teqsqsdqsdqsdqsdqs.qdqsdsq.qdqsqsddazfdgfdhgfjhgfjhgdqs}'),
(6, '{teqdqqsdqsdqsdqsdqsst.qdqsdsq.qdqsdqs}'),
(7, '{teqst.qdqsdsq.qdqsdqsdqsdqsdqsdqsqs}'),
(8, '{teqsdqsdqssazdazdqdqsdqsdsqdqsdqst.qdqsdsq.qdqsdsdsqdqsdqsqs}'),
(9, '{teqqsdqsdqsdqsdqsdqsdqsdqsdqsst.qdqsdsq.qdqsdqsdqsdqsdqsqs}');
-- After executing the insert batch on master, failure occurs on slave
--
-- 2018-07-28 10:46:05.230 CEST [3132] LOG: database system is ready to accept connections
-- 2018-07-28 10:46:05.240 CEST [3705] LOG: logical replication apply worker for subscription "itscaro_test" has started
-- 2018-07-28 10:46:05.269 CEST [3132] LOG: worker process: logical replication worker for subscription 275990 (PID 3705) was terminated by signal 11: Segmentation fault
-- 2018-07-28 10:46:05.269 CEST [3132] LOG: terminating any other active server processes
-- 2018-07-28 10:46:05.274 CEST [3702] WARNING: terminating connection because of crash of another server process
Aha, I see, thanks. Here's a complete fix with included testcase. In an unpatched assert-enabled build, this crashes this TRAP: FailedAssertion("!(ActiveSnapshotSet())", File: "/pgsql/source/REL_10_STABLE/src/backend/tcop/postgres.c", Line: 788) Will push on Monday. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2018-Jul-28, Alvaro Herrera wrote: > Aha, I see, thanks. Here's a complete fix with included testcase. In > an unpatched assert-enabled build, this crashes this > > TRAP: FailedAssertion("!(ActiveSnapshotSet())", File: "/pgsql/source/REL_10_STABLE/src/backend/tcop/postgres.c", Line:788) > > Will push on Monday. Pushed after changing the constraint in the test case to be less silly. Thanks for the report and diagnosis. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thank you Alvaro :) > Le 30 juil. 2018 à 22:33, Alvaro Herrera <alvherre@2ndquadrant.com> a écrit : > > On 2018-Jul-28, Alvaro Herrera wrote: > >> Aha, I see, thanks. Here's a complete fix with included testcase. In >> an unpatched assert-enabled build, this crashes this >> >> TRAP: FailedAssertion("!(ActiveSnapshotSet())", File: "/pgsql/source/REL_10_STABLE/src/backend/tcop/postgres.c", Line:788) >> >> Will push on Monday. > > Pushed after changing the constraint in the test case to be less silly. > Thanks for the report and diagnosis. > > -- > Álvaro Herrera https://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services