Re: [HACKERS] Logical replication existing data copy - Mailing list pgsql-hackers

From Erik Rijkers
Subject Re: [HACKERS] Logical replication existing data copy
Date
Msg-id 0a4418aff31920c92c1a446ad20d89f3@xs4all.nl
Whole thread Raw
In response to Re: [HACKERS] Logical replication existing data copy  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
Responses Re: [HACKERS] Logical replication existing data copy  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
List pgsql-hackers
On 2017-02-25 00:40, Petr Jelinek wrote:

> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
> 0002-Fix-after-trigger-execution-in-logical-replication.patch
> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
> snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
> snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
> snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
> snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch
> 0001-Logical-replication-support-for-initial-data-copy-v6.patch

Here are some results. There is improvement although it's not an 
unqualified success.

Several repeat-runs of pgbench_derail2.sh, with different parameters for 
number-of-client yielded an output file each.

Those show that logrep is now pretty stable when there is only 1 client 
(pgbench -c 1).  But it starts making mistakes with 4, 8, 16 clients.  
I'll just show a grep of the output files; I think it is 
self-explicatory:

Output-files (lines counted with  grep | sort | uniq -c):

-- out_20170225_0129.txt    250 -- pgbench -c 1 -j 8 -T 10 -P 5 -n    250 -- All is well.

-- out_20170225_0654.txt     25 -- pgbench -c 4 -j 8 -T 10 -P 5 -n     24 -- All is well.      1 -- Not good, but
breakingout of wait (waited more than 60s)
 

-- out_20170225_0711.txt     25 -- pgbench -c 8 -j 8 -T 10 -P 5 -n     23 -- All is well.      2 -- Not good, but
breakingout of wait (waited more than 60s)
 

-- out_20170225_0803.txt     25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n     11 -- All is well.     14 -- Not good, but
breakingout of wait (waited more than 60s)
 

So, that says:
1 clients: 250x success, zero fail (250 not a typo, ran this overnight)
4 clients: 24x success, 1 fail
8 clients: 23x success, 2 fail
16 clients: 11x success, 14 fail

I want to repeat what I said a few emails back: problems seem to 
disappear when a short wait state is introduced (directly after the 
'alter subscription sub1 enable' line) to give the logrep machinery time 
to 'settle'. It makes one think of a timing error somewhere (now don't 
ask me where..).

To show that, here is pgbench_derail2.sh output that waited 10 seconds 
(INIT_WAIT in the script) as such a 'settle' period works faultless 
(with 16 clients):

-- out_20170225_0852.txt     25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n     25 -- All is well.

QED.

(By the way, no hanged sessions so far, so that's good)


thanks

Erik Rijkers



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Proposal : Parallel Merge Join
Next
From: Dilip Kumar
Date:
Subject: Re: [HACKERS] Proposal : Parallel Merge Join