Re: Logical replication existing data copy - Mailing list pgsql-hackers

From Erik Rijkers
Subject Re: Logical replication existing data copy
Date
Msg-id 03a1305e84a17ad729f4075fcf3f4767@xs4all.nl
Whole thread Raw
In response to Re: [HACKERS] Logical replication existing data copy  (Erik Rijkers <er@xs4all.nl>)
Responses Re: Logical replication existing data copy
List pgsql-hackers
On 2017-03-09 11:06, Erik Rijkers wrote:
>>> 
>>> I use three different machines (2 desktop, 1 server) to test logical
>>> replication, and all three have now at least once failed to correctly
>>> synchronise a pgbench session (amidst many succesful runs, of course)
>> 


(At the moment using tese patches for tests:)

> 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
> 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch  +
> 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch      +
> 0005-Skip-unnecessary-snapshot-builds.patch                    +


The failed tests that I kept seeing (see the 
pgbench-over-logical-replication tests upthread) were never really 
'solved'.


But I have now finally figured out what caused these unexpected failed 
tests: it was  wal_sender_timeout  or rather, its default of 60 s.

This caused 'terminating walsender process due to replication timeout' 
on the primary (not strictly an error), and the concomittant ERROR on 
the replica: 'could not receive data from WAL stream: server closed the 
connection unexpectedly'.

here is a typical example (primary/replica logs time-intertwined, with 
'primary'):

[...]
2017-03-24 16:21:38.129 CET [15002]  primary    LOG:  using stale 
statistics instead of current ones because stats collector is not 
responding
2017-03-24 16:21:42.690 CET [27515]  primary    LOG:  using stale 
statistics instead of current ones because stats collector is not 
responding
2017-03-24 16:21:42.965 CET [14999]    replica  LOG:  using stale 
statistics instead of current ones because stats collector is not 
responding
2017-03-24 16:21:49.816 CET [14930]  primary    LOG:  terminating 
walsender process due to
2017-03-24 16:21:49.817 CET [14926]    replica  ERROR:  could not 
receive data from WAL stream: server closed the connection unexpectedly
2017-03-24 16:21:49.824 CET [27502]    replica  LOG:  worker process: 
logical replication worker for subscription 24864 (PID 14926) exited 
with exit code 1
2017-03-24 16:21:49.824 CET [27521]    replica  LOG:  starting logical 
replication worker for subscription "sub1"
2017-03-24 16:21:49.828 CET [15008]    replica  LOG:  logical 
replication apply for subscription sub1 started
2017-03-24 16:21:49.832 CET [15009]  primary    LOG:  received 
replication command: IDENTIFY_SYSTEM
2017-03-24 16:21:49.832 CET [15009]  primary    LOG:  received 
replication command: START_REPLICATION SLOT "sub1" LOGICAL 3/FC976440 
(proto_version '1', publication_names '"pub1"')
2017-03-24 16:21:49.833 CET [15009]  primary    DETAIL:  streaming 
transactions committing after 3/FC889810, reading WAL from 3/FC820FC0
2017-03-24 16:21:49.833 CET [15009]  primary    LOG:  starting logical 
decoding for slot "sub1"
2017-03-24 16:21:50.471 CET [15009]  primary    DETAIL:  Logical 
decoding will begin using saved snapshot.
2017-03-24 16:21:50.471 CET [15009]  primary    LOG:  logical decoding 
found consistent point at 3/FC820FC0
2017-03-24 16:21:51.169 CET [15008]    replica  DETAIL:  Key 
(hid)=(9014) already exists.
2017-03-24 16:21:51.169 CET [15008]    replica  ERROR:  duplicate key 
value violates unique constraint "pgbench_history_pkey"
2017-03-24 16:21:51.170 CET [27502]    replica  LOG:  worker process: 
logical replication worker for subscription 24864 (PID 15008) exited 
with exit code 1
2017-03-24 16:21:51.170 CET [27521]    replica  LOG:  starting logical 
replication worker for subscription "sub1"
[...]

My primary and replica were always on a single machine (making it more 
likely that that timeout is reached?).  In my testing it seems that 
reaching the timeout on the primary (and 'closing the connection 
unexpectedly' on the replica) does not necessarily break the logical 
replication.  But almost all log-rep failures that I have seen were 
started by this sequence of events.

After setting  wal_sender_timeout  to 3 minutes there were no more 
failed tests.

Perhaps it warrants setting  wal_sender_timeout  a bit higher than the 
current default of 60 seconds?  After all I also saw the 'replication 
timeout' / 'closed the connection' couple rather often during 
not-failing tests.  (These also disappeared, almost completely,  with a 
higher  setting of wal_sender_timeout)

In any case it would be good to mention the setting (and its potentially 
deteriorating effect) somehere nearer the logical replication treatment.

( I read about wal_sender_timeout and keepalive ping, perhaps there's 
(still) something amiss there? Just a guess, I don't know )

As I said, I saw no more failures with the higher 3 minute setting, with 
one exception: the one test that straddled the DST change (saterday 24 
march 02:00 h).  I am happy to discount that one failure but strictly 
speaking I suppose it should be able to take DST into its stride.


Thanks,

Erik Rijkers












pgsql-hackers by date:

Previous
From: Alexander Law
Date:
Subject: [PATCH] Remove trailing spaces
Next
From: Amit Langote
Date:
Subject: Re: Partitioned tables and relfilenode