On 2017-02-11 11:16, Erik Rijkers wrote:
> On 2017-02-08 23:25, Petr Jelinek wrote:
>
>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
>> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
>> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
>> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
>> 0001-Logical-replication-support-for-initial-data-copy-v4.patch
>
> This often works but it also fails far too often (in my hands). I
> test whether the tables are identical by comparing an md5 from an
> ordered resultset, from both replica and master. I estimate that 1 in
> 5 tries fail; 'fail' being a somewhat different table on replica
> (compared to mater), most often pgbench_accounts (typically there are
> 10-30 differing rows). No errors or warnings in either logfile. I'm
> not sure but I think testing on faster machines seem to be doing
> somewhat better ('better' being less replication error).
>
I have noticed that when I insert a few seconds wait-state after the
create subscription (or actually: the 'enable'ing of the subscription)
the problem does not occur. Apparently, (I assume) the initial snapshot
occurs somewhere when the subsequent pgbench-run has already started, so
that the logical replication also starts somewhere 'into' that
pgbench-run. Does that make sense?
I don't know what to make of it. Now that I think that I understand
what happens I hesitate to call it a bug. But I'd say it's still a
useability problem that the subscription is only 'valid' after some
time, even if it's only a few seconds.
(the other problem I mentioned (drop subscription hangs) still happens
every now and then)