Re: [HACKERS] Logical replication existing data copy - Mailing list pgsql-hackers

From Petr Jelinek
Subject Re: [HACKERS] Logical replication existing data copy
Date
Msg-id 16135dcb-0b52-2347-2173-9fb2cfeef7ad@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] Logical replication existing data copy  (Erik Rijkers <er@xs4all.nl>)
Responses Re: [HACKERS] Logical replication existing data copy  (Erik Rijkers <er@xs4all.nl>)
Re: [HACKERS] Logical replication existing data copy  (Erik Rijkers <er@xs4all.nl>)
List pgsql-hackers
On 13/02/17 14:51, Erik Rijkers wrote:
> On 2017-02-11 11:16, Erik Rijkers wrote:
>> On 2017-02-08 23:25, Petr Jelinek wrote:
>>
>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
>>> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
>>> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
>>> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
>>> 0001-Logical-replication-support-for-initial-data-copy-v4.patch
>>
>> This often works but it also fails far too often (in my hands).  I
>> test whether the tables are identical by comparing an md5 from an
>> ordered resultset, from both replica and master.  I estimate that 1 in
>> 5 tries fail; 'fail'  being a somewhat different table on replica
>> (compared to mater), most often pgbench_accounts (typically there are
>> 10-30 differing rows).  No errors or warnings in either logfile.   I'm
>> not sure but I think testing on faster machines seem to be doing
>> somewhat better ('better' being less replication error).
>>
> 
> I have noticed that when I insert a few seconds wait-state after the
> create subscription (or actually: the 'enable'ing of the subscription)
> the problem does not occur.  Apparently, (I assume) the initial snapshot
> occurs somewhere when the subsequent pgbench-run has already started, so
> that the logical replication also starts somewhere 'into' that
> pgbench-run. Does that make sense?
> 
> I don't know what to make of it.  Now that I think that I understand
> what happens I hesitate to call it a bug. But I'd say it's still a
> useability problem that the subscription is only 'valid' after some
> time, even if it's only a few seconds.
> 

It is a bug, we are going to great lengths to create data snapshot that
corresponds to specific LSN so that we are able to decode exactly the
changes that happened since the data snapshot was taken. And the
tablecopy.c does quite a lot to synchronize table handover to main apply
process so that there is correct continuation of data stream as well. So
the end result is that concurrent changes are supposed to be okay and
eventually replication should catch up and the contents should be the same.

That being said, I am so far having problems reproducing this on my test
machine(s) so no idea what causes it yet.

Could you periodically dump contents of the pg_subscription_rel on
subscriber (ideally when dumping the md5 of the data) and attach that as
well?

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] bytea_output vs make installcheck
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] bytea_output vs make installcheck