Re: Logical replication existing data copy - Mailing list pgsql-hackers
From | Petr Jelinek |
---|---|
Subject | Re: Logical replication existing data copy |
Date | |
Msg-id | 1f810a1c-d57b-e8af-44cb-2e55defb30f8@2ndquadrant.com Whole thread Raw |
In response to | Re: Logical replication existing data copy (Erik Rijkers <er@xs4all.nl>) |
List | pgsql-hackers |
On 29/03/17 10:14, Erik Rijkers wrote: > On 2017-03-09 11:06, Erik Rijkers wrote: >>>> >>>> I use three different machines (2 desktop, 1 server) to test logical >>>> replication, and all three have now at least once failed to correctly >>>> synchronise a pgbench session (amidst many succesful runs, of course) >>> > > > (At the moment using tese patches for tests:) > >> 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch + >> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+ >> 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch + >> 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch + >> 0005-Skip-unnecessary-snapshot-builds.patch + > > > The failed tests that I kept seeing (see the > pgbench-over-logical-replication tests upthread) were never really > 'solved'. > > > But I have now finally figured out what caused these unexpected failed > tests: it was wal_sender_timeout or rather, its default of 60 s. > > This caused 'terminating walsender process due to replication timeout' > on the primary (not strictly an error), and the concomittant ERROR on > the replica: 'could not receive data from WAL stream: server closed the > connection unexpectedly'. > > here is a typical example (primary/replica logs time-intertwined, with > 'primary'): > > [...] > 2017-03-24 16:21:38.129 CET [15002] primary LOG: using stale > statistics instead of current ones because stats collector is not > responding > 2017-03-24 16:21:42.690 CET [27515] primary LOG: using stale > statistics instead of current ones because stats collector is not > responding > 2017-03-24 16:21:42.965 CET [14999] replica LOG: using stale > statistics instead of current ones because stats collector is not > responding > 2017-03-24 16:21:49.816 CET [14930] primary LOG: terminating > walsender process due to > 2017-03-24 16:21:49.817 CET [14926] replica ERROR: could not > receive data from WAL stream: server closed the connection unexpectedly > 2017-03-24 16:21:49.824 CET [27502] replica LOG: worker process: > logical replication worker for subscription 24864 (PID 14926) exited > with exit code 1 > 2017-03-24 16:21:49.824 CET [27521] replica LOG: starting logical > replication worker for subscription "sub1" > 2017-03-24 16:21:49.828 CET [15008] replica LOG: logical > replication apply for subscription sub1 started > 2017-03-24 16:21:49.832 CET [15009] primary LOG: received > replication command: IDENTIFY_SYSTEM > 2017-03-24 16:21:49.832 CET [15009] primary LOG: received > replication command: START_REPLICATION SLOT "sub1" LOGICAL 3/FC976440 > (proto_version '1', publication_names '"pub1"') > 2017-03-24 16:21:49.833 CET [15009] primary DETAIL: streaming > transactions committing after 3/FC889810, reading WAL from 3/FC820FC0 > 2017-03-24 16:21:49.833 CET [15009] primary LOG: starting logical > decoding for slot "sub1" > 2017-03-24 16:21:50.471 CET [15009] primary DETAIL: Logical > decoding will begin using saved snapshot. > 2017-03-24 16:21:50.471 CET [15009] primary LOG: logical decoding > found consistent point at 3/FC820FC0 > 2017-03-24 16:21:51.169 CET [15008] replica DETAIL: Key > (hid)=(9014) already exists. > 2017-03-24 16:21:51.169 CET [15008] replica ERROR: duplicate key > value violates unique constraint "pgbench_history_pkey" > 2017-03-24 16:21:51.170 CET [27502] replica LOG: worker process: > logical replication worker for subscription 24864 (PID 15008) exited > with exit code 1 > 2017-03-24 16:21:51.170 CET [27521] replica LOG: starting logical > replication worker for subscription "sub1" > [...] > > My primary and replica were always on a single machine (making it more > likely that that timeout is reached?). In my testing it seems that > reaching the timeout on the primary (and 'closing the connection > unexpectedly' on the replica) does not necessarily break the logical > replication. But almost all log-rep failures that I have seen were > started by this sequence of events. > > After setting wal_sender_timeout to 3 minutes there were no more > failed tests. > > Perhaps it warrants setting wal_sender_timeout a bit higher than the > current default of 60 seconds? After all I also saw the 'replication > timeout' / 'closed the connection' couple rather often during > not-failing tests. (These also disappeared, almost completely, with a > higher setting of wal_sender_timeout) > > In any case it would be good to mention the setting (and its potentially > deteriorating effect) somehere nearer the logical replication treatment. > > ( I read about wal_sender_timeout and keepalive ping, perhaps there's > (still) something amiss there? Just a guess, I don't know ) > > As I said, I saw no more failures with the higher 3 minute setting, with > one exception: the one test that straddled the DST change (saterday 24 > march 02:00 h). I am happy to discount that one failure but strictly > speaking I suppose it should be able to take DST into its stride. > > Hi, I think what you have seen is because of this: https://www.postgresql.org/message-id/flat/b235fa69-147a-5e09-f8f3-3f780a1abb2e@2ndquadrant.com#b235fa69-147a-5e09-f8f3-3f780a1abb2e@2ndquadrant.com -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: