Home > mailing lists

Re: Logical replication existing data copy - Mailing list pgsql-hackers

From	Petr Jelinek
Subject	Re: Logical replication existing data copy
Date	March 29, 2017 16:06:41
Msg-id	1f810a1c-d57b-e8af-44cb-2e55defb30f8@2ndquadrant.com Whole thread Raw
In response to	Re: Logical replication existing data copy (Erik Rijkers <er@xs4all.nl>)
List	pgsql-hackers

Tree view

On 29/03/17 10:14, Erik Rijkers wrote:
> On 2017-03-09 11:06, Erik Rijkers wrote:
>>>>
>>>> I use three different machines (2 desktop, 1 server) to test logical
>>>> replication, and all three have now at least once failed to correctly
>>>> synchronise a pgbench session (amidst many succesful runs, of course)
>>>
> 
> 
> (At the moment using tese patches for tests:)
> 
>> 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
>> 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch  +
>> 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch      +
>> 0005-Skip-unnecessary-snapshot-builds.patch                    +
> 
> 
> The failed tests that I kept seeing (see the
> pgbench-over-logical-replication tests upthread) were never really
> 'solved'.
> 
> 
> But I have now finally figured out what caused these unexpected failed
> tests: it was  wal_sender_timeout  or rather, its default of 60 s.
> 
> This caused 'terminating walsender process due to replication timeout'
> on the primary (not strictly an error), and the concomittant ERROR on
> the replica: 'could not receive data from WAL stream: server closed the
> connection unexpectedly'.
> 
> here is a typical example (primary/replica logs time-intertwined, with
> 'primary'):
> 
> [...]
> 2017-03-24 16:21:38.129 CET [15002]  primary    LOG:  using stale
> statistics instead of current ones because stats collector is not
> responding
> 2017-03-24 16:21:42.690 CET [27515]  primary    LOG:  using stale
> statistics instead of current ones because stats collector is not
> responding
> 2017-03-24 16:21:42.965 CET [14999]    replica  LOG:  using stale
> statistics instead of current ones because stats collector is not
> responding
> 2017-03-24 16:21:49.816 CET [14930]  primary    LOG:  terminating
> walsender process due to
> 2017-03-24 16:21:49.817 CET [14926]    replica  ERROR:  could not
> receive data from WAL stream: server closed the connection unexpectedly
> 2017-03-24 16:21:49.824 CET [27502]    replica  LOG:  worker process:
> logical replication worker for subscription 24864 (PID 14926) exited
> with exit code 1
> 2017-03-24 16:21:49.824 CET [27521]    replica  LOG:  starting logical
> replication worker for subscription "sub1"
> 2017-03-24 16:21:49.828 CET [15008]    replica  LOG:  logical
> replication apply for subscription sub1 started
> 2017-03-24 16:21:49.832 CET [15009]  primary    LOG:  received
> replication command: IDENTIFY_SYSTEM
> 2017-03-24 16:21:49.832 CET [15009]  primary    LOG:  received
> replication command: START_REPLICATION SLOT "sub1" LOGICAL 3/FC976440
> (proto_version '1', publication_names '"pub1"')
> 2017-03-24 16:21:49.833 CET [15009]  primary    DETAIL:  streaming
> transactions committing after 3/FC889810, reading WAL from 3/FC820FC0
> 2017-03-24 16:21:49.833 CET [15009]  primary    LOG:  starting logical
> decoding for slot "sub1"
> 2017-03-24 16:21:50.471 CET [15009]  primary    DETAIL:  Logical
> decoding will begin using saved snapshot.
> 2017-03-24 16:21:50.471 CET [15009]  primary    LOG:  logical decoding
> found consistent point at 3/FC820FC0
> 2017-03-24 16:21:51.169 CET [15008]    replica  DETAIL:  Key
> (hid)=(9014) already exists.
> 2017-03-24 16:21:51.169 CET [15008]    replica  ERROR:  duplicate key
> value violates unique constraint "pgbench_history_pkey"
> 2017-03-24 16:21:51.170 CET [27502]    replica  LOG:  worker process:
> logical replication worker for subscription 24864 (PID 15008) exited
> with exit code 1
> 2017-03-24 16:21:51.170 CET [27521]    replica  LOG:  starting logical
> replication worker for subscription "sub1"
> [...]
> 
> My primary and replica were always on a single machine (making it more
> likely that that timeout is reached?).  In my testing it seems that
> reaching the timeout on the primary (and 'closing the connection
> unexpectedly' on the replica) does not necessarily break the logical
> replication.  But almost all log-rep failures that I have seen were
> started by this sequence of events.
> 
> After setting  wal_sender_timeout  to 3 minutes there were no more
> failed tests.
> 
> Perhaps it warrants setting  wal_sender_timeout  a bit higher than the
> current default of 60 seconds?  After all I also saw the 'replication
> timeout' / 'closed the connection' couple rather often during
> not-failing tests.  (These also disappeared, almost completely,  with a
> higher  setting of wal_sender_timeout)
> 
> In any case it would be good to mention the setting (and its potentially
> deteriorating effect) somehere nearer the logical replication treatment.
> 
> ( I read about wal_sender_timeout and keepalive ping, perhaps there's
> (still) something amiss there? Just a guess, I don't know )
> 
> As I said, I saw no more failures with the higher 3 minute setting, with
> one exception: the one test that straddled the DST change (saterday 24
> march 02:00 h).  I am happy to discount that one failure but strictly
> speaking I suppose it should be able to take DST into its stride.
> 
> 

Hi,

I think what you have seen is because of this:

https://www.postgresql.org/message-id/flat/b235fa69-147a-5e09-f8f3-3f780a1abb2e@2ndquadrant.com#b235fa69-147a-5e09-f8f3-3f780a1abb2e@2ndquadrant.com

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

pgsql-hackers by date:

From: Nikolay Shaplov
Date: 29 March 2017, 15:51:43
Subject: Re: [PATCH v.7] Move all am-related reloption code into src/backend/access/[am-name] and get rid of relopt_kind for custom AM

From: Michael Paquier
Date: 29 March 2017, 16:08:03
Subject: Re: Allow interrupts on waiting standby

Re: Logical replication existing data copy - Mailing list pgsql-hackers

Previous

Next