Re: [HACKERS] logical replication - still unstable after all thesemonths - Mailing list pgsql-hackers

From Erik Rijkers
Subject Re: [HACKERS] logical replication - still unstable after all thesemonths
Date
Msg-id 9d592410c042fedfa6dc10e19adf7180@xs4all.nl
Whole thread Raw
In response to Re: [HACKERS] logical replication - still unstable after all thesemonths  (Mark Kirkwood <mark.kirkwood@catalyst.net.nz>)
List pgsql-hackers
On 2017-05-27 01:35, Mark Kirkwood wrote:
> On 26/05/17 20:09, Erik Rijkers wrote:
>> 
>> this whole thing 100x
> 
> Some questions that might help me get it right:
> - do you think we need to stop and start the instances every time?
> - do we need to init pgbench each time?
> - could we just drop the subscription and publication and truncate the 
> replica tables instead?

I have done all that in earler versions.

I deliberately added these 'complications' in view of the intractability 
of the problem: my fear is that an earlier failure leaves some 
half-failed state behind in an instance, which then might cause more 
failure.  This would undermine the intent of the whole exercise (which 
is to count succes/failure rate).  So it is important to be as sure as 
possible that each cycle starts out as cleanly as possible.

> - what scale pgbench are you running?

I use a small script to call the main script; at the moment it does 
something like:
-------------------
duration=60
from=1
to=100
for scale in 25 5
do  for clients in 90 64 8  do    date_str=$(date +"%Y%m%d_%H%M")    outfile=out_${date_str}.txt    time for x in `seq
$from$to`    do        ./pgbench_derail2.sh $scale $clients $duration $date_str
 
[...]
-------------------

> - how many clients for the 1 min pgbench run?

see above

> - are you starting the pgbench run while the copy_data jobs for the 
> subscription are still running?

I assume with copy_data you mean the data sync of the original table 
before pgbench starts.
And yes, I think here might be the origin of the problem.
( I think the problem I get is actually easily avoided by putting wait 
states here and there in between separate steps.  But the testing idea 
here is to force the system into error, not to avoid any errors)

> - how exactly are you calculating those md5's?

Here is the bash function: cb (I forget what that stands for, I guess 
'content bench').  $outf is a log file to which the program writes 
output:

---------------------------
function cb()
{  #  display the 4 pgbench tables' accumulated content as md5s  #  a,b,t,h stand for:  pgbench_accounts, -branches,
-tellers,-history  num_tables=$( echo "select count(*) from pg_class where relkind = 'r' 
 
and relname ~ '^pgbench_'" | psql -qtAX )  if [[ $num_tables -ne 4 ]]  then     echo "pgbench tables not 4 - exit" >>
$outf    exit  fi  for port in $port1 $port2  do    md5_a=$(echo "select * from pgbench_accounts order by aid"|psql 
 
-qtAXp $port|md5sum|cut -b 1-9)    md5_b=$(echo "select * from pgbench_branches order by bid"|psql 
-qtAXp $port|md5sum|cut -b 1-9)    md5_t=$(echo "select * from pgbench_tellers  order by tid"|psql 
-qtAXp $port|md5sum|cut -b 1-9)    md5_h=$(echo "select * from pgbench_history  order by hid"|psql 
-qtAXp $port|md5sum|cut -b 1-9)    cnt_a=$(echo "select count(*) from pgbench_accounts"      |psql 
-qtAXp $port)    cnt_b=$(echo "select count(*) from pgbench_branches"      |psql 
-qtAXp $port)    cnt_t=$(echo "select count(*) from pgbench_tellers"       |psql 
-qtAXp $port)    cnt_h=$(echo "select count(*) from pgbench_history"       |psql 
-qtAXp $port)    md5_total[$port]=$( echo "${md5_a} ${md5_b} ${md5_t} ${md5_h}" | 
md5sum )    printf "$port a,b,t,h: %8d %6d %6d %6d" $cnt_a $cnt_b $cnt_t $cnt_h    echo -n "  $md5_a $md5_b $md5_t
$md5_h"   if   [[ $port -eq $port1 ]]; then echo    " master"    elif [[ $port -eq $port2 ]]; then echo -n " replica"
else                              echo    "           ERROR  "    fi  done  if [[ "${md5_total[$port1]}" ==
"${md5_total[$port2]}"]]  then    echo " ok"  else    echo " NOK"  fi
 
}
---------------------------

this enables:

echo "-- getting md5 (cb)"
cb_text1=$(cb)

and testing that string like:
    if echo "$cb_text1" | grep -qw 'replica ok';    then       echo "-- All is well."

[...]


Later today I'll try to clean up the whole thing and post it.















pgsql-hackers by date:

Previous
From: Euler Taveira
Date:
Subject: Re: [HACKERS] ALTER SUBSCRIPTION ..SET PUBLICATION refreshis not throwing error.
Next
From: Erik Rijkers
Date:
Subject: Re: [HACKERS] logical replication - still unstable after all thesemonths