Thread: Segfault logical replication PG 10.4

Segfault logical replication PG 10.4

From
Mai Peng
Date:

We discovered our pg_wal partition was full few days after setting our first logical publication on a PG 10.4 instance.
Then, we can not synchronise our slave to the master, it triggers a segfault on the slave. We had to drop manually the subscription on slave and the slot on master.
Then, we wanted to find the cause of this bug, stop connection between master and slave , after 30 minutes, the slave had a segfault and could not synchronise.
Why does the slave can not synchronise without a complete creation subscription after dropping the slot?
How to manage the replication, knowing we use cloud vm and issue network latency.

Here the details of conf and error logs:
Conf on master:
max_replication_slots = 10
max_sync_workers_per_subscription = 2
wal_receiver_timeout: 60s
wal_keep_segments : 1000
wal_receiver_status_interval :10
wal_retrieve_retry_interval :5 s
max_logical_replication_workers :4
Conf on slave
same except wal_keep_segments=0

Error log on slave:
LOG: logical replication apply worker for subscription « XXXX" has started
DEBUG: connecting to publisher using connection string "postgresql://USER@IP"
LOG: worker process: logical replication worker for subscription 132253 (PID 25359) was terminated by signal 11: Segmentation fault
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly co
rrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: all server processes terminated; reinitializing
DEBUG: unregistering background worker "logical replication worker for subscription 132253"
LOG: database system was interrupted; last known up at 2018-07-11 21:50:56 UTC
DEBUG: checkpoint record is at 0/7DBFEF10
DEBUG: redo record is at 0/7DBFEF10; shutdown TRUE
DEBUG: next transaction ID: 0:93714; next OID: 140237
DEBUG: next MultiXactId: 1; next MultiXactOffset: 0
DEBUG: oldest unfrozen transaction ID: 548, in database 1
DEBUG: oldest MultiXactId: 1, in database 1
DEBUG: commit timestamp Xid oldest/newest: 0/0
DEBUG: transaction ID wrap limit is 2147484195, limited by database with OID 1
DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1
DEBUG: starting up replication slots
LOG: recovered replication state of node 2 to 0/0
LOG: recovered replication state of node 3 to 0/0
LOG: recovered replication state of node 4 to 0/0
LOG: recovered replication state of node 5 to 56A5/29ACA918
LOG: database system was not properly shut down; automatic recovery in progress



THANK YOU

Re: Segfault logical replication PG 10.4

From
Peter Eisentraut
Date:
On 12.07.18 00:56, Mai Peng wrote:
> We discovered our pg_wal partition was full few days after setting our
> first logical publication on a PG 10.4 instance.
> Then, we can not synchronise our slave to the master, it triggers a
> segfault on the slave. We had to drop manually the subscription on slave
> and the slot on master.
> Then, we wanted to find the cause of this bug, stop connection between
> master and slave , after 30 minutes, the slave had a segfault and could
> not synchronise.
> Why does the slave can not synchronise without a complete creation
> subscription after dropping the slot?

Can you get a backtrace from the core dump produced by the segfault?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Segfault logical replication PG 10.4

From
Mai Peng
Date:
Hello,
Here the backtrace

```Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: bgworker: logical replication worker for subscription'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  GetActiveSnapshot () at ./build/../src/backend/utils/time/snapmgr.c:843
843     ./build/../src/backend/utils/time/snapmgr.c: No such file or directory.
(gdb) up
#1  0x000056239fc83ecd in postquel_start (fcache=0x5623a1c83b78, es=0x5623a1c91a70)
    at ./build/../src/backend/executor/functions.c:809
809     ./build/../src/backend/executor/functions.c: No such file or directory.
(gdb) up
#2  fmgr_sql (fcinfo=0x5623a1c7af98) at ./build/../src/backend/executor/functions.c:1150
1150    in ./build/../src/backend/executor/functions.c
(gdb) up
#3  0x000056239fc75b4f in ExecInterpExpr (state=0x5623a1c7aeb8, econtext=0x5623a1c7b898, isnull=<optimized out>)
    at ./build/../src/backend/executor/execExprInterp.c:672
672     ./build/../src/backend/executor/execExprInterp.c: No such file or directory.
(gdb) up
#4  0x000056239fc72a96 in ExecEvalExprSwitchContext (isNull=0x7ffdc20d1487 "", econtext=0x5623a1c7b898,
    state=<optimized out>) at ./build/../src/include/executor/executor.h:308
308     ./build/../src/include/executor/executor.h: No such file or directory.
(gdb) up
#5  ExecCheck (state=<optimized out>, econtext=econtext@entry=0x5623a1c7b898)
    at ./build/../src/backend/executor/execExpr.c:556
556     ./build/../src/backend/executor/execExpr.c: No such file or directory.
(gdb) up
#6  0x000056239fdefec0 in domain_check_input (value=value@entry=94711038065448, isnull=0 '\000',
    my_extra=my_extra@entry=0x5623a1c79bd0) at ./build/../src/backend/utils/adt/domains.c:180
180     ./build/../src/backend/utils/adt/domains.c: No such file or directory.
(gdb) up
#7  0x000056239fdf0100 in domain_in (fcinfo=0x7ffdc20d1530) at ./build/../src/backend/utils/adt/domains.c:255
255     in ./build/../src/backend/utils/adt/domains.c
(gdb) up
#8  0x000056239feccd85 in InputFunctionCall (flinfo=flinfo@entry=0x7ffdc20d1900,
    str=str@entry=0x5623a1bf4698 "{test.Type.blabla}", typioparam=typioparam@entry=28498, typmod=typmod@entry=-1) at
./build/../src/backend/utils/fmgr/fmgr.c:1640
1640    ./build/../src/backend/utils/fmgr/fmgr.c: No such file or directory.
(gdb) up
#9  0x000056239fecd01d in OidInputFunctionCall (functionId=<optimized out>,
    str=0x5623a1bf4698 "{test.Type.blabla}",
    typioparam=28498, typmod=-1) at ./build/../src/backend/utils/fmgr/fmgr.c:1743
1743    in ./build/../src/backend/utils/fmgr/fmgr.c
(gdb) up
#10 0x000056239fd6429f in slot_store_cstrings (slot=slot@entry=0x5623a1c664e8, rel=rel@entry=0x5623a1c60bf0,
    values=values@entry=0x7ffdc20d1a70) at ./build/../src/backend/replication/logical/worker.c:354
354     ./build/../src/backend/replication/logical/worker.c: No such file or directory.
(gdb) up
#11 0x000056239fd64cc2 in apply_handle_insert (s=s@entry=0x7ffdc20d5580)
    at ./build/../src/backend/replication/logical/worker.c:612
612     in ./build/../src/backend/replication/logical/worker.c
(gdb) up
#12 0x000056239fd65e5b in apply_dispatch (s=0x7ffdc20d5580) at ./build/../src/backend/replication/logical/worker.c:912
912     in ./build/../src/backend/replication/logical/worker.c
(gdb) up
#13 LogicalRepApplyLoop (last_received=97628639619392) at ./build/../src/backend/replication/logical/worker.c:1119
1119    in ./build/../src/backend/replication/logical/worker.c
(gdb) up
#14 ApplyWorkerMain (main_arg=<optimized out>) at ./build/../src/backend/replication/logical/worker.c:1671
1671    in ./build/../src/backend/replication/logical/worker.c
(gdb) up
#15 0x000056239fd3540c in StartBackgroundWorker () at ./build/../src/backend/postmaster/bgworker.c:835
835     ./build/../src/backend/postmaster/bgworker.c: No such file or directory.
(gdb) up
#16 0x000056239fd42175 in do_start_bgworker (rw=0x5623a1b71870) at ./build/../src/backend/postmaster/postmaster.c:5728
5728    ./build/../src/backend/postmaster/postmaster.c: No such file or directory.
(gdb) up
#17 maybe_start_bgworkers () at ./build/../src/backend/postmaster/postmaster.c:5941
5941    in ./build/../src/backend/postmaster/postmaster.c
(gdb) up
#18 0x000056239fd42d35 in sigusr1_handler (postgres_signal_arg=<optimized out>)
    at ./build/../src/backend/postmaster/postmaster.c:5121
5121    in ./build/../src/backend/postmaster/postmaster.c
(gdb) up
#19 <signal handler called>
(gdb) up
#20 0x00007f533261a3a3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) up
#21 0x000056239fae8c71 in ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1719
1719    ./build/../src/backend/postmaster/postmaster.c: No such file or directory.
(gdb) up
#22 0x000056239fd4407a in PostmasterMain (argc=3, argv=0x5623a1b49ef0)
    at ./build/../src/backend/postmaster/postmaster.c:1363
1363    in ./build/../src/backend/postmaster/postmaster.c
(gdb) up
#23 0x000056239faeac74 in main (argc=3, argv=0x5623a1b49ef0) at ./build/../src/backend/main/main.c:228
228     ./build/../src/backend/main/main.c: No such file or directory.```

BR

> Le 14 juil. 2018 à 13:15, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> a écrit :
>
> On 12.07.18 00:56, Mai Peng wrote:
>> We discovered our pg_wal partition was full few days after setting our
>> first logical publication on a PG 10.4 instance.
>> Then, we can not synchronise our slave to the master, it triggers a
>> segfault on the slave. We had to drop manually the subscription on slave
>> and the slot on master.
>> Then, we wanted to find the cause of this bug, stop connection between
>> master and slave , after 30 minutes, the slave had a segfault and could
>> not synchronise.
>> Why does the slave can not synchronise without a complete creation
>> subscription after dropping the slot?
>
> Can you get a backtrace from the core dump produced by the segfault?
>
> --
> Peter Eisentraut              http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Segfault logical replication PG 10.4

From
Tom Lane
Date:
Mai Peng <maily.peng@webedia-group.com> writes:
> Here the backtrace

Hmm .. so this can be summarized as "logical replication workers should
provide an ActiveSnapshot in case the user functions they call want one".
Makes me wonder how much other transactional infrastructure is needed
but not present.

            regards, tom lane


Re: Segfault logical replication PG 10.4

From
Mai Peng
Date:
Hello ,
Some new input:
On slave, all domains ( with checks) have been replaced by a simple type. No crash on slave since this bypass.
Is there something to fix in the ActiveSnapshot code ?
BR


> Le 18 juil. 2018 à 17:03, Tom Lane <tgl@sss.pgh.pa.us> a écrit :
>
> Mai Peng <maily.peng@webedia-group.com> writes:
>> Here the backtrace
>
> Hmm .. so this can be summarized as "logical replication workers should
> provide an ActiveSnapshot in case the user functions they call want one".
> Makes me wonder how much other transactional infrastructure is needed
> but not present.
>
>             regards, tom lane



Re: Segfault logical replication PG 10.4

From
Quan TRAN
Date:
Hello,

In worker.c:

    - in apply_handle_insert, slot_store_cstrings is called before PushActiveSnapshot
    - in apply_handle_update & apply_handle_delete, slot_store_cstrings is called after PushActiveSnapshot

    /* Process and store remote tuple in the slot */
    oldctx = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
    slot_store_cstrings(remoteslot, rel, newtup.values);
    slot_fill_defaults(rel, estate, remoteslot);
    MemoryContextSwitchTo(oldctx);

    PushActiveSnapshot(GetTransactionSnapshot());
Should this be the cause?

Thank you.

Re: Segfault logical replication PG 10.4

From
Alvaro Herrera
Date:
On 2018-Jul-19, Quan TRAN wrote:

> Hello,
> 
> In worker.c:
> 
>     - in apply_handle_insert, slot_store_cstrings is called before
> PushActiveSnapshot
>     - in apply_handle_update & apply_handle_delete, slot_store_cstrings is
> called after PushActiveSnapshot
> 
>         /* Process and store remote tuple in the slot */
>         oldctx = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
>         slot_store_cstrings(remoteslot, rel, newtup.values);
>         slot_fill_defaults(rel, estate, remoteslot);
>         MemoryContextSwitchTo(oldctx);
> 
>         PushActiveSnapshot(GetTransactionSnapshot());
> 
> Should this be the cause?

Yeah, moving the PushActiveSnapshot call to just before
slot_store_cstrings sounds right.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Segfault logical replication PG 10.4

From
Minh-Quan Tran
Date:
Something like this?

Thank you.
Attachment

Re: Segfault logical replication PG 10.4

From
Alvaro Herrera
Date:
On 2018-Jul-20, Minh-Quan Tran wrote:

> Something like this?

Can you provide a reproducer for this problem?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Segfault logical replication PG 10.4

From
Minh-Quan Tran
Date:
Hello,

Here is the steps to reproduce this problem.

Thank you.

Pastebin : https://pastebin.com/Wc80YegY

--

CREATE EXTENSION IF NOT EXISTS postgis;
CREATE EXTENSION IF NOT EXISTS ltree;
CREATE EXTENSION IF NOT EXISTS intarray;
CREATE EXTENSION IF NOT EXISTS hstore;
CREATE EXTENSION IF NOT EXISTS unaccent;
CREATE EXTENSION IF NOT EXISTS pgcrypto;

CREATE SCHEMA IF NOT EXISTS itscaro;

CREATE DOMAIN public.tag3level AS ltree
  constraint tag3level_check CHECK (public.nlevel(VALUE) = 3);

CREATE OR REPLACE FUNCTION public.is_valid_array_tag3level(ltree[])
  RETURNS boolean AS
$BODY$

  SELECT NOT EXISTS (
    SELECT 1
     FROM unnest ($1) tag
     WHERE tag operator(public.<>) tag::public.tag3level
  )

$BODY$
  LANGUAGE sql IMMUTABLE STRICT;

CREATE DOMAIN tag3level_array AS ltree[] CHECK ((public.is_valid_array_tag3level(VALUE)));

CREATE TABLE IF NOT EXISTS itscaro.test
(
    id int,
    tag public.tag3level_array DEFAULT '{}'
);

insert into itscaro.test values (2, '{{test1.test2.test3}');
-- NOTE: initial data copy is done correctly

-- Pub/Sub

-- Master
CREATE PUBLICATION itscaro_test
    FOR TABLE itscaro.test;

-- Slave
--  On slave as this moment we have all domains, functions, extensions and testing table
CREATE SUBSCRIPTION itscaro_test
  CONNECTION 'postgresql://postgres@<FILL IN MASTER IP>/postgres'
  PUBLICATION itscaro_test;

-- Inserting new data after subscription is registered

insert into itscaro.test values (2, '{tefghjfgqst.qdqsdsq.qdqsdgfjhfghqs}'),
(3, '{teqghjghjgfst.qdqsdsq.qdqsdqssqdqsdqsd}'),
(4, '{tjgfjhghjgfjheqst.qdqqsdqsdjghjdsdsq.qdqsdqsqsdqsdqs}'),
(5, '{teqsqsdqsdqsdqsdqs.qdqsdsq.qdqsqsddazfdgfdhgfjhgfjhgdqs}'),
(6, '{teqdqqsdqsdqsdqsdqsst.qdqsdsq.qdqsdqs}'),
(7, '{teqst.qdqsdsq.qdqsdqsdqsdqsdqsdqsqs}'),
(8, '{teqsdqsdqssazdazdqdqsdqsdsqdqsdqst.qdqsdsq.qdqsdsdsqdqsdqsqs}'),
(9, '{teqqsdqsdqsdqsdqsdqsdqsdqsdqsst.qdqsdsq.qdqsdqsdqsdqsdqsqs}');

-- After executing the insert batch on master, failure occurs on slave
--
-- 2018-07-28 10:46:05.230 CEST [3132] LOG:  database system is ready to accept connections
-- 2018-07-28 10:46:05.240 CEST [3705] LOG:  logical replication apply worker for subscription "itscaro_test" has started
-- 2018-07-28 10:46:05.269 CEST [3132] LOG:  worker process: logical replication worker for subscription 275990 (PID 3705) was terminated by signal 11: Segmentation fault
-- 2018-07-28 10:46:05.269 CEST [3132] LOG:  terminating any other active server processes
-- 2018-07-28 10:46:05.274 CEST [3702] WARNING:  terminating connection because of crash of another server process

Re: Segfault logical replication PG 10.4

From
Quan Tran
Date:
Sorry, there is an error in the previous message:

insert into itscaro.test values (2, '{{test1.test2.test3}');

should be corrected with:

insert into itscaro.test values (2, '{test1.test2.test3}');

Re: Segfault logical replication PG 10.4

From
Alvaro Herrera
Date:
Aha, I see, thanks.  Here's a complete fix with included testcase.  In
an unpatched assert-enabled build, this crashes this

TRAP: FailedAssertion("!(ActiveSnapshotSet())", File: "/pgsql/source/REL_10_STABLE/src/backend/tcop/postgres.c", Line:
788)

Will push on Monday.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Segfault logical replication PG 10.4

From
Alvaro Herrera
Date:
On 2018-Jul-28, Alvaro Herrera wrote:

> Aha, I see, thanks.  Here's a complete fix with included testcase.  In
> an unpatched assert-enabled build, this crashes this
> 
> TRAP: FailedAssertion("!(ActiveSnapshotSet())", File: "/pgsql/source/REL_10_STABLE/src/backend/tcop/postgres.c",
Line:788)
 
> 
> Will push on Monday.

Pushed after changing the constraint in the test case to be less silly.
Thanks for the report and diagnosis.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Segfault logical replication PG 10.4

From
Mai Peng
Date:
Thank you Alvaro :)

> Le 30 juil. 2018 à 22:33, Alvaro Herrera <alvherre@2ndquadrant.com> a écrit :
>
> On 2018-Jul-28, Alvaro Herrera wrote:
>
>> Aha, I see, thanks.  Here's a complete fix with included testcase.  In
>> an unpatched assert-enabled build, this crashes this
>>
>> TRAP: FailedAssertion("!(ActiveSnapshotSet())", File: "/pgsql/source/REL_10_STABLE/src/backend/tcop/postgres.c",
Line:788) 
>>
>> Will push on Monday.
>
> Pushed after changing the constraint in the test case to be less silly.
> Thanks for the report and diagnosis.
>
> --
> Álvaro Herrera                https://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services