Thread: [HACKERS] logical replication - still unstable after all these months

[HACKERS] logical replication - still unstable after all these months

From
Erik Rijkers
Date:
If you run a pgbench session of 1 minute over a logical replication 
connection and repeat that 100x this is what you get:

At clients 90, 64, 8, scale 25:

-- out_20170525_0944.txt    100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25     93 -- All is well.      7 --
Notgood.
 
-- out_20170525_1426.txt    100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n   --  scale 25     82 -- All is well.     18 --
Notgood.
 
-- out_20170525_2049.txt    100 -- pgbench -c 8 -j 8 -T 60 -P 12 -n   --  scale 25     90 -- All is well.     10 -- Not
good


At clients 90, 64, 8, scale 25:

-- out_20170526_0126.txt    100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 5     98 -- All is well.      2 -- Not
good.
-- out_20170526_0352.txt    100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n   --  scale 5     97 -- All is well.      3 -- Not
good.
-- out_20170526_0621.txt     45 -- pgbench -c 8 -j 8 -T 60 -P 12 -n   --  scale 5     41 -- All is well.      3 -- Not
good.
        (That last one obviously not finished)


I think this is pretty awful, really, for a beta level.

The above installations (master+replica) are with Petr Jelinek's (and 
Michael Paquier's) last patches 0001-Fix-signal-handling-in-logical-workers.patch
0002-Make-tablesync-worker-exit-when-apply-dies-while-it-.patch
0003-Receive-invalidation-messages-correctly-in-tablesync.patch
Remove-the-SKIP-REFRESH-syntax-suggar-in-ALTER-SUBSC-v2.patch

Now, it could be that there is somehow something wrong with my 
test-setup (as opposed to some bug in log-repl).  I can post my test 
program, but I'll do that separately (but below is the core all my tests 
-- it's basically still that very first test that I started out with, 
many months ago...)


I'd like to find out/know more about:
- Do you agree this number of failures is far too high?
- Am I the only one finding so many failures?
- Is anyone else testing the same way (more or less continually, finding 
only succes)?
- Which of the Open Items could be resposible for this failure rate?  (I 
don't see a match.)
- What tests do others do?  Could we somehow concentrate results and 
method somewhere?


Thanks,


Erik Rijkers




PS

The core of the 'pgbench_derail' test (bash) is simply:

echo "drop table if exists pgbench_accounts;
drop table if exists pgbench_branches;
drop table if exists pgbench_tellers;
drop table if exists pgbench_history;" | psql -qXp $port1 \
&& echo "drop table if exists pgbench_accounts;
drop table if exists pgbench_branches;
drop table if exists pgbench_tellers;
drop table if exists pgbench_history;" | psql -qXp $port2 \
&& pgbench -p $port1 -qis $scale \
&& echo "alter table pgbench_history add column hid serial primary key;" 
\ | psql -q1Xp $port1 && pg_dump -F c -p $port1 \    --exclude-table-data=pgbench_history  \
--exclude-table-data=pgbench_accounts\    --exclude-table-data=pgbench_branches \
--exclude-table-data=pgbench_tellers \  -t pgbench_history -t pgbench_accounts \  -t pgbench_branches -t
pgbench_tellers\ | pg_restore -1 -p $port2 -d testdb
 
appname=derail2
echo "create publication pub1 for all tables;" | psql -p $port1 -aqtAX
echo "create subscription sub1 connection 'port=${port1}  application_name=$appname' publication pub1
with(enabled=false);
alter subscription sub1 enable;" | psql -p $port2 -aqtAX

pgbench -c $clients -j $threads -T $duration -P $pseconds -n    #  scale 
$scale

Now compare md5's of the sorted content of each of the 4 pgbench tables 
on primary and replica.  They should be the same.




On 26 May 2017 at 07:10, Erik Rijkers <er@xs4all.nl> wrote:

> - Do you agree this number of failures is far too high?
> - Am I the only one finding so many failures?

What type of failure are you getting?

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-26 08:58, Simon Riggs wrote:
> On 26 May 2017 at 07:10, Erik Rijkers <er@xs4all.nl> wrote:
> 
>> - Do you agree this number of failures is far too high?
>> - Am I the only one finding so many failures?
> 
> What type of failure are you getting?

The failure is that in the result state the replicated tables differ 
from the original tables.

For instance,

-- out_20170525_0944.txt    100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25     93 -- All is well.      7 --
Notgood.
 

These numbers mean: the result state of primary and replica is not the 
same, in 7 out of 100 runs.

'not the same state' means:  at least one of the 4 md5's of the sorted 
content of the 4 pgbench tables on the primary is different from those 
taken from the replica.

So, 'failure' means: the 4 pgbench tables on primary and replica are not 
exactly the same after the (one-minute) pgbench-run has finished, and 
logical replication has 'finished'.  (plenty of time is given for the 
replica to catchup. The test only calls 'failure' after 20x waiting (for 
15 seconds) and 20x finding the same erroneous state (erroneous because 
not-same as on primary).


I would really like to know it you think that that doesn't amount to 
'failure'.




On 26 May 2017 at 08:27, Erik Rijkers <er@xs4all.nl> wrote:
> On 2017-05-26 08:58, Simon Riggs wrote:
>>
>> On 26 May 2017 at 07:10, Erik Rijkers <er@xs4all.nl> wrote:
>>
>>> - Do you agree this number of failures is far too high?
>>> - Am I the only one finding so many failures?
>>
>>
>> What type of failure are you getting?
>
>
> The failure is that in the result state the replicated tables differ from
> the original tables.

An important point would be note that that this is time dependent.

> I would really like to know it you think that that doesn't amount to
> 'failure'.

Yes, your test has failed.

Even one record on one test is a serious problem and needs to be fixed.

If we can find out what the bug is with a repeatable test case we can fix it.

Could you provide more details? Thanks

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-26 09:40, Simon Riggs wrote:
> 
> If we can find out what the bug is with a repeatable test case we can 
> fix it.
> 
> Could you provide more details? Thanks

I will, just need some time to clean things up a bit.


But what I would like is for someone else to repeat my 100x1-minute 
tests, taking as core that snippet I posted in my previous email.  I 
built bash-stuff around that core (to take md5's, shut-down/start-up the 
two instances between runs, write info to log-files, etc).  But it would 
be good if someone else made that separately because if that then does 
not fail, it would prove that my test-harness is at fault (and not 
logical replication).

The idea is simple enough:

startup instance1
startup instance2 (on same machine)
primary: init pgbench tables
primary: add primary key to pgbench_history
copy empty tables to replica by dump/restore
primary: start publication
replica: start subscription
primary: run 1-minute pgbench
wait till the 4 md5's of primary pgbench tables  are the same as the 4 md5's of replica pgbench  tables (this will need
atime-out).
 
log 'ok' or 'not ok'
primary: clean up publication
replica: clean up subscription
shutdown primary
shutdown replica

this whole thing 100x






Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Mark Kirkwood
Date:
On 26/05/17 20:09, Erik Rijkers wrote:

> On 2017-05-26 09:40, Simon Riggs wrote:
>>
>> If we can find out what the bug is with a repeatable test case we can 
>> fix it.
>>
>> Could you provide more details? Thanks
>
> I will, just need some time to clean things up a bit.
>
>
> But what I would like is for someone else to repeat my 100x1-minute 
> tests, taking as core that snippet I posted in my previous email.  I 
> built bash-stuff around that core (to take md5's, shut-down/start-up 
> the two instances between runs, write info to log-files, etc).  But it 
> would be good if someone else made that separately because if that 
> then does not fail, it would prove that my test-harness is at fault 
> (and not logical replication).
>

Will do - what I had been doing was running pgbench, waiting until the 
row counts on the replica pgbench_history were the same as the primary, 
then summing the %balance and delta fields from the primary and replica 
dbs and comparing. So far - all match up ok. However I'd been running a 
longer time frames (5 minutes), so not the same number of repetitions as 
yet.

regards

Mark



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-26 10:29, Mark Kirkwood wrote:
> On 26/05/17 20:09, Erik Rijkers wrote:
> 
>> On 2017-05-26 09:40, Simon Riggs wrote:
>>> 
>>> If we can find out what the bug is with a repeatable test case we can 
>>> fix it.
>>> 
>>> Could you provide more details? Thanks
>> 
>> I will, just need some time to clean things up a bit.
>> 
>> 
>> But what I would like is for someone else to repeat my 100x1-minute 
>> tests, taking as core that snippet I posted in my previous email.  I 
>> built bash-stuff around that core (to take md5's, shut-down/start-up 
>> the two instances between runs, write info to log-files, etc).  But it 
>> would be good if someone else made that separately because if that 
>> then does not fail, it would prove that my test-harness is at fault 
>> (and not logical replication).
>> 
> 
> Will do - what I had been doing was running pgbench, waiting until the

Great!

You'll have to think about whether to go with instances of either 
master, or master+those 4 patches.  I guess either choice makes sense.

> row counts on the replica pgbench_history were the same as the
> primary, then summing the %balance and delta fields from the primary
> and replica dbs and comparing. So far - all match up ok. However I'd

I did number-summing for a while as well (because it's a lot faster than 
taking md5's over the full content).
But the problem with summing is that (I think) in the end you cannot be 
really sure that the result is correct (false positives, although I 
don't understand the odds).

> been running a longer time frames (5 minutes), so not the same number
> of repetitions as yet.

I've run 3600-, 30- and 15-minute runs too, but in this case (these 100x 
tests) I wanted to especially test the area around startup/initialise of 
logical replication.  Also the increasing quality of logical replication 
(once it runs with the correct

thanks,

Erik Rijkers



On 05/26/2017 12:57 PM, Erik Rijkers wrote:
> The failure is that in the result state the replicated tables differ 
> from the original tables. 
I am also getting similar behavior

Master=
run pgbench with scaling factor =1  (./pg_bench -i -s 1 postgres  )
delete rows from pgbench_history  ( delete from pgbench_history)
create publication  (create publication pub for table pgbench_history)

Slave=
run pgbench with scaling factor =1  (./pg_bench -i -s 1 postgres -p 5000  )
delete rows from pgbench_history  ( delete from pgbench_history)
create subscription (create subscription sub connection 'dbname=postgres 
host=localhost user=centos) publication pub;

create a test.sql file , having an insert statement
[centos@centos-cpula bin]$ cat test.sql
insert into pgbench_history values (1,1,1,1,now(),'anv');

now run pgbench with -T / -c / -j options
First time = ./pgbench -t 5   -c 90 -j 90  -f test.sql  postgres

count on Master/slave are SAME .

run second time =
./pgbench -T 20   -c 90 -j 90  -f test.sql  postgres

check the row count on master/standby
Master=
postgres=# select count(*) from pgbench_history ; count
-------- 536836
(1 row)

Standby =

postgres=#  select count(*) from pgbench_history ;  count
--------- 1090959
(1 row)

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Petr Jelinek
Date:
Hi,

Hmm, I was under the impression that the changes we proposed in the
snapbuild thread fixed your issues, does this mean they didn't? Or the
modified versions of those that were eventually committed didn't? Or did
issues reappear at some point?

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-26 15:59, Petr Jelinek wrote:
> Hi,
> 
> Hmm, I was under the impression that the changes we proposed in the
> snapbuild thread fixed your issues, does this mean they didn't? Or the
> modified versions of those that were eventually committed didn't? Or 
> did
> issues reappear at some point?

I do think the snapbuild fixed solved certain problems.  I can't say 
where the present problems are caused (as I have said, I suspect logical 
replication, but also my own test-harness: perhaps it leaves some 
error-state lying around (although I do try hard to prevent that) -- so 
I just don't know.

I wouldn't say that problems (re)appeared at a certain point; my 
impression is rather that logical replication has become better and 
better.  But I kept getting the odd failure, without a clear cause, but 
always (eventually) repeatable on other machines.  I did the 1-minute 
pgbench-derail version exactly because of the earlier problems with 
snapbuild: I wanted a test that does a lot of starting and stopping of 
publication and subscription.


Erik Rijkers







Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Alvaro Herrera
Date:
Erik Rijkers wrote:

> I wouldn't say that problems (re)appeared at a certain point; my impression
> is rather that logical replication has become better and better.  But I kept
> getting the odd failure, without a clear cause, but always (eventually)
> repeatable on other machines.  I did the 1-minute pgbench-derail version
> exactly because of the earlier problems with snapbuild: I wanted a test that
> does a lot of starting and stopping of publication and subscription.

I think it is pretty unlikely that the logical replication plumbing is
the buggy place.  You're just seeing it now becaues we didn't have any
mechanism as convenient to consume logical decoding output.  In other
words, I strongly suspect that the hyphothetical bugs are in the logical
decoding side (and snapbuild sounds the most promising candidate) rather
than logical replication per se.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Petr Jelinek
Date:
On 26/05/17 16:51, Alvaro Herrera wrote:
> Erik Rijkers wrote:
> 
>> I wouldn't say that problems (re)appeared at a certain point; my impression
>> is rather that logical replication has become better and better.  But I kept
>> getting the odd failure, without a clear cause, but always (eventually)
>> repeatable on other machines.  I did the 1-minute pgbench-derail version
>> exactly because of the earlier problems with snapbuild: I wanted a test that
>> does a lot of starting and stopping of publication and subscription.
> 
> I think it is pretty unlikely that the logical replication plumbing is
> the buggy place.  You're just seeing it now becaues we didn't have any
> mechanism as convenient to consume logical decoding output.  In other
> words, I strongly suspect that the hyphothetical bugs are in the logical
> decoding side (and snapbuild sounds the most promising candidate) rather
> than logical replication per se.
> 

Well, that was true for the previous issues Erik found as well (mostly
snapshot builder was problematic). But that does not mean there are no
issues elsewhere. We could do with some more output from the tests (do
they log some intermediary states of those md5 checksums, maybe numbers
of rows etc?), description of the problems, errors from logs, etc. I for
example don't get any issues from similar test as the one described in
this thread so without more info it might be hard to reproduce and fix
whatever the underlaying issue is.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



On Fri, May 26, 2017 at 5:17 AM, tushar <tushar.ahuja@enterprisedb.com> wrote:

run second time =
./pgbench -T 20   -c 90 -j 90  -f test.sql  postgres

check the row count on master/standby
Master=
postgres=# select count(*) from pgbench_history ;
 count
--------
 536836
(1 row)

Standby =

postgres=#  select count(*) from pgbench_history ;
  count
---------
 1090959
(1 row)

Hi Tushar,

pgbench starts out by truncating pgbench_history.  That truncation does not get replicated to the subscriber.  The later inserts do.  So your subscriber ends up with about twice as many rows.

Cheers,

Jeff
On Fri, May 26, 2017 at 12:27 AM, Erik Rijkers <er@xs4all.nl> wrote:
On 2017-05-26 08:58, Simon Riggs wrote:
On 26 May 2017 at 07:10, Erik Rijkers <er@xs4all.nl> wrote:

- Do you agree this number of failures is far too high?
- Am I the only one finding so many failures?

What type of failure are you getting?

The failure is that in the result state the replicated tables differ from the original tables.

But what is the actual failure?  Which tables differ?  Do they have the same number of rows? Do they only differ in the *balance column or something else?  Are they transactionally consistent?

I have not been able to replicate the problem.

Cheers,

Jeff

Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Mark Kirkwood
Date:
On 26/05/17 20:09, Erik Rijkers wrote:

>
> The idea is simple enough:
>
> startup instance1
> startup instance2 (on same machine)
> primary: init pgbench tables
> primary: add primary key to pgbench_history
> copy empty tables to replica by dump/restore
> primary: start publication
> replica: start subscription
> primary: run 1-minute pgbench
> wait till the 4 md5's of primary pgbench tables
>   are the same as the 4 md5's of replica pgbench
>   tables (this will need a time-out).
> log 'ok' or 'not ok'
> primary: clean up publication
> replica: clean up subscription
> shutdown primary
> shutdown replica
>
> this whole thing 100

I might have a look at scripting this up (especially if it keeps raining 
here)...

Some questions that might help me get it right:
- do you think we need to stop and start the instances every time?
- do we need to init pgbench each time?
- could we just drop the subscription and publication and truncate the 
replica tables instead?
- what scale pgbench are you running?
- how many clients for the 1 min pgbench run?
- are you starting the pgbench run while the copy_data jobs for the 
subscription are still running?
- how exactly are you calculating those md5's?

Cheers

Mark





Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-27 01:35, Mark Kirkwood wrote:
> On 26/05/17 20:09, Erik Rijkers wrote:
>> 
>> this whole thing 100x
> 
> Some questions that might help me get it right:
> - do you think we need to stop and start the instances every time?
> - do we need to init pgbench each time?
> - could we just drop the subscription and publication and truncate the 
> replica tables instead?

I have done all that in earler versions.

I deliberately added these 'complications' in view of the intractability 
of the problem: my fear is that an earlier failure leaves some 
half-failed state behind in an instance, which then might cause more 
failure.  This would undermine the intent of the whole exercise (which 
is to count succes/failure rate).  So it is important to be as sure as 
possible that each cycle starts out as cleanly as possible.

> - what scale pgbench are you running?

I use a small script to call the main script; at the moment it does 
something like:
-------------------
duration=60
from=1
to=100
for scale in 25 5
do  for clients in 90 64 8  do    date_str=$(date +"%Y%m%d_%H%M")    outfile=out_${date_str}.txt    time for x in `seq
$from$to`    do        ./pgbench_derail2.sh $scale $clients $duration $date_str
 
[...]
-------------------

> - how many clients for the 1 min pgbench run?

see above

> - are you starting the pgbench run while the copy_data jobs for the 
> subscription are still running?

I assume with copy_data you mean the data sync of the original table 
before pgbench starts.
And yes, I think here might be the origin of the problem.
( I think the problem I get is actually easily avoided by putting wait 
states here and there in between separate steps.  But the testing idea 
here is to force the system into error, not to avoid any errors)

> - how exactly are you calculating those md5's?

Here is the bash function: cb (I forget what that stands for, I guess 
'content bench').  $outf is a log file to which the program writes 
output:

---------------------------
function cb()
{  #  display the 4 pgbench tables' accumulated content as md5s  #  a,b,t,h stand for:  pgbench_accounts, -branches,
-tellers,-history  num_tables=$( echo "select count(*) from pg_class where relkind = 'r' 
 
and relname ~ '^pgbench_'" | psql -qtAX )  if [[ $num_tables -ne 4 ]]  then     echo "pgbench tables not 4 - exit" >>
$outf    exit  fi  for port in $port1 $port2  do    md5_a=$(echo "select * from pgbench_accounts order by aid"|psql 
 
-qtAXp $port|md5sum|cut -b 1-9)    md5_b=$(echo "select * from pgbench_branches order by bid"|psql 
-qtAXp $port|md5sum|cut -b 1-9)    md5_t=$(echo "select * from pgbench_tellers  order by tid"|psql 
-qtAXp $port|md5sum|cut -b 1-9)    md5_h=$(echo "select * from pgbench_history  order by hid"|psql 
-qtAXp $port|md5sum|cut -b 1-9)    cnt_a=$(echo "select count(*) from pgbench_accounts"      |psql 
-qtAXp $port)    cnt_b=$(echo "select count(*) from pgbench_branches"      |psql 
-qtAXp $port)    cnt_t=$(echo "select count(*) from pgbench_tellers"       |psql 
-qtAXp $port)    cnt_h=$(echo "select count(*) from pgbench_history"       |psql 
-qtAXp $port)    md5_total[$port]=$( echo "${md5_a} ${md5_b} ${md5_t} ${md5_h}" | 
md5sum )    printf "$port a,b,t,h: %8d %6d %6d %6d" $cnt_a $cnt_b $cnt_t $cnt_h    echo -n "  $md5_a $md5_b $md5_t
$md5_h"   if   [[ $port -eq $port1 ]]; then echo    " master"    elif [[ $port -eq $port2 ]]; then echo -n " replica"
else                              echo    "           ERROR  "    fi  done  if [[ "${md5_total[$port1]}" ==
"${md5_total[$port2]}"]]  then    echo " ok"  else    echo " NOK"  fi
 
}
---------------------------

this enables:

echo "-- getting md5 (cb)"
cb_text1=$(cb)

and testing that string like:
    if echo "$cb_text1" | grep -qw 'replica ok';    then       echo "-- All is well."

[...]


Later today I'll try to clean up the whole thing and post it.















Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-27 01:35, Mark Kirkwood wrote:
> On 26/05/17 20:09, Erik Rijkers wrote:
> 
>> 
>> The idea is simple enough:
>> 
>> startup instance1
>> startup instance2 (on same machine)
>> primary: init pgbench tables
>> primary: add primary key to pgbench_history
>> copy empty tables to replica by dump/restore
>> primary: start publication
>> replica: start subscription
>> primary: run 1-minute pgbench
>> wait till the 4 md5's of primary pgbench tables
>>   are the same as the 4 md5's of replica pgbench
>>   tables (this will need a time-out).
>> log 'ok' or 'not ok'
>> primary: clean up publication
>> replica: clean up subscription
>> shutdown primary
>> shutdown replica
>> 
>> this whole thing 100x


Here is what I have:

instances.sh:
   starts up 2 assert enabled sessions

instances_fast.sh:
   alternative to instances.sh
   starts up 2 assert disabled 'fast' sessions

testset.sh
   loop to call pgbench_derail2.sh with varying params

pgbench_derail2.sh
   main test program
   can be called 'standalone'
     ./pgbench_derail2.sh $scale $clients $duration $date_str
   so for instance this should work:
     ./pgbench_derail2.sh 25 64 60 20170527_1019
   to remove publication and subscription from sessions, add a 5th 
parameter 'clean'
     ./pgbench_derail2.sh 1 1 1 1 'clean'

pubsub.sh
   displays replication state. also called by pgbench_derail2.sh
   must be in path

result.sh
   display results
   I keep this in a screen-session as:
   watch -n 20 './result.sh 201705'


Peculiar to my setup also:
   server version at compile time stamped with date + commit hash
   I misuse information_schema.sql_packages  at compile time to store 
patch information
   instances are in  $pg_stuff_dir/pg_installations/pgsql.<project name>

So you'll have to outcomment a line here and there, and adapt paths, 
ports, and things like that.

It's a bit messy, I should have used perl from the beginning...

Good luck :)

Erik Rijkers








-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-27 10:30, Erik Rijkers wrote:
> On 2017-05-27 01:35, Mark Kirkwood wrote:

> Here is what I have:
> 
> instances.sh:
> testset.sh
> pgbench_derail2.sh
> pubsub.sh


To be clear:

( Apart from that standalone call like    ./pgbench_derail2.sh $scale $clients $duration $date_str
)

I normally run by editing the parameters in testset.sh, then run:

./testset.sh

that then shows a tail -F of the output-logfile (to paste into another 
screen).

in yet another screen the 'watch -n20 results.sh' line

The output=files are the .txt files.
The logfiles of the instances are (at the end of each test) copied to 
directory  logfiles/
under a meaningful name that shows the parameters, and with an extension 
like '.ok.log' or '.NOK.log'.

I am very curious at your results.




On 27 May 2017 at 09:44, Erik Rijkers <er@xs4all.nl> wrote:

> I am very curious at your results.

We take your bug report on good faith, but we still haven't seen
details of the problem or how to recreate it.

Please post some details. Thanks.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] logical replication - still unstable after all these months

From
Andres Freund
Date:

On May 27, 2017 6:13:19 AM EDT, Simon Riggs <simon@2ndquadrant.com> wrote:
>On 27 May 2017 at 09:44, Erik Rijkers <er@xs4all.nl> wrote:
>
>> I am very curious at your results.
>
>We take your bug report on good faith, but we still haven't seen
>details of the problem or how to recreate it.
>
>Please post some details. Thanks.

?
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-27 17:11, Andres Freund wrote:
> On May 27, 2017 6:13:19 AM EDT, Simon Riggs <simon@2ndquadrant.com> 
> wrote:
>> On 27 May 2017 at 09:44, Erik Rijkers <er@xs4all.nl> wrote:
>> 
>>> I am very curious at your results.
>> 
>> We take your bug report on good faith, but we still haven't seen
>> details of the problem or how to recreate it.
>> 
>> Please post some details. Thanks.
> 
> ?

ok, ok...

( The thing is, I am trying to pre-digest the output but it takes time )

I can do this now: attached some output that belongs with this group of 
100  1-minute runs:

-- out_20170525_1426.txt
     100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n   --  scale 25
      82 -- All is well.
      18 -- Not good.

That is the worst set of runs of what I showed earlier.

that is:  out_20170525_1426.txt  and
2x18 logfiles that the 18 failed runs produced.
Those logfiles have names like:
logrep.20170525_1426.1436.1.scale_25.clients_64.NOK.log
logrep.20170525_1426.1436.2.scale_25.clients_64.NOK.log

.1.=primary
.2.=replica

Please disregard the errors around pg_current_wal_location().  (it was 
caused by some code to dump some wal into zipfiles which obviously 
stopped working after the function was removed/renamed)  There are also 
some uninportant errors from the test-harness where I call with the 
wrong port.  Not interesting, I don't think.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Mark Kirkwood
Date:
Interesting - might be good to see your test script too (so we can 
better understand how you are deciding if the runs are successful or not).


Also, any idea which rows are different? If you want something out of 
the box that will do that for you see DBIx::Compare.

regards

Mark


On 28/05/17 04:12, Erik Rijkers wrote:
>
> ok, ok...
>
> ( The thing is, I am trying to pre-digest the output but it takes time )
>
> I can do this now: attached some output that belongs with this group 
> of 100  1-minute runs:
>
> -- out_20170525_1426.txt
>     100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n   --  scale 25
>      82 -- All is well.
>      18 -- Not good.
>
> That is the worst set of runs of what I showed earlier.
>
> that is:  out_20170525_1426.txt  and
> 2x18 logfiles that the 18 failed runs produced.
> Those logfiles have names like:
> logrep.20170525_1426.1436.1.scale_25.clients_64.NOK.log
> logrep.20170525_1426.1436.2.scale_25.clients_64.NOK.log
>
> .1.=primary
> .2.=replica
>
> Please disregard the errors around pg_current_wal_location().  (it was 
> caused by some code to dump some wal into zipfiles which obviously 
> stopped working after the function was removed/renamed) There are also 
> some uninportant errors from the test-harness where I call with the 
> wrong port.  Not interesting, I don't think.
>
>




Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Mark Kirkwood
Date:
Sorry - I see you have done this already.

On 28/05/17 11:15, Mark Kirkwood wrote:
> Interesting - might be good to see your test script too (so we can 
> better understand how you are deciding if the runs are successful or 
> not).
>




Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-28 01:21, Mark Kirkwood wrote:
> Sorry - I see you have done this already.
> 
> On 28/05/17 11:15, Mark Kirkwood wrote:
>> Interesting - might be good to see your test script too (so we can 
>> better understand how you are deciding if the runs are successful or 
>> not).


Yes, in pgbench_derail2.sh in the cb function it says:
  if [[ "${md5_total[$port1]}" == "${md5_total[$port2]}" ]]  then    echo " ok"  else    echo " NOK"  fi

This is the final decision about success ('ok') or failure ('NOK'). (NOK 
stands for 'Not OK')

The two compared md5's (on the two ports: primary and replica) are each 
taken over a concatenation of the 4 separate md5's of the table-content 
(taken earlier in cb()).  If one or more of the 4 md5's differs, then 
that concatation-md5 will differ too.

Sorry, there is not a lot of comment







Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-28 01:15, Mark Kirkwood wrote:

> Also, any idea which rows are different? If you want something out of
> the box that will do that for you see DBIx::Compare.

I used to save the content-diffs too but in the end decided they were 
useless (to me, anyway).



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Mark Kirkwood
Date:
On 27/05/17 20:30, Erik Rijkers wrote:

>
>
> Here is what I have:
>
> instances.sh:
>   starts up 2 assert enabled sessions
>
> instances_fast.sh:
>   alternative to instances.sh
>   starts up 2 assert disabled 'fast' sessions
>
> testset.sh
>   loop to call pgbench_derail2.sh with varying params
>
> pgbench_derail2.sh
>   main test program
>   can be called 'standalone'
>     ./pgbench_derail2.sh $scale $clients $duration $date_str
>   so for instance this should work:
>     ./pgbench_derail2.sh 25 64 60 20170527_1019
>   to remove publication and subscription from sessions, add a 5th 
> parameter 'clean'
>     ./pgbench_derail2.sh 1 1 1 1 'clean'
>
> pubsub.sh
>   displays replication state. also called by pgbench_derail2.sh
>   must be in path
>
> result.sh
>   display results
>   I keep this in a screen-session as:
>   watch -n 20 './result.sh 201705'
>
>
> Peculiar to my setup also:
>   server version at compile time stamped with date + commit hash
>   I misuse information_schema.sql_packages  at compile time to store 
> patch information
>   instances are in $pg_stuff_dir/pg_installations/pgsql.<project name>
>
> So you'll have to outcomment a line here and there, and adapt paths, 
> ports, and things like that.
>
> It's a bit messy, I should have used perl from the beginning...
>

Considering it is all shell - pretty nice! I spent a bit of time today 
getting this working in a vanilla Ubuntu 16.04 cloud server. I found a 
few things that didn't work (suspect Erik has some default env variables 
set for ports and databases). These were sufficient to stop logical 
replication working for me at all - due to no dbname specified in the 
subscription connection.

Given I had to make some changes anyway, I moved all the config into one 
place (new file config.sh) - made all the programs use /bin/bash as 
interpreter (/bin/sh just does not work for scripts on Ubuntu), added 
ports and databases as reqd and fixed the need to mess too much with 
PATH (see attached diff).

So running in cloud land now...so for no errors - will update.

regards

Mark

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-26 15:59, Petr Jelinek wrote:
> Hmm, I was under the impression that the changes we proposed in the
> snapbuild thread fixed your issues, does this mean they didn't? Or the
> modified versions of those that were eventually committed didn't? Or 
> did
> issues reappear at some point?

Here is a bit of info:

Just now (using Mark Kirkwood's version of my test) I had a session 
logging this:
  unknown relation state "w"

which I had never seen before.

This is column srsubstate in pg_subscription_rel.

That session completed successfully ('replica ok'), so it's not 
necessarily a problem.


grepping through my earlier logs (of weeks of intermittent test-runs), I 
found only one more (timestamp 20170525_0125).  Here it occurred in a 
failed session.

No idea what it means.  At the very least this value 'w' is missing from 
the documentation, which only mentions:  i = initalize  d = data copy  s = synchronized  r = (normal replication)


Erik Rijkers









Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Mark Kirkwood
Date:
On 28/05/17 19:01, Mark Kirkwood wrote:

>
> So running in cloud land now...so for no errors - will update.
>
>
>

The framework ran 600 tests last night, and I see 3 'NOK' results, i.e 3 
failed test runs (all scale 25 and 8 pgbench clients). Given the way the 
test decides on failure (gets tired of waiting for the table md5's to 
match) - it begs the question 'What if it had waited a bit longer'? 
However from what I can see in all cases:

- the rowcounts were the same in master and replica
- the md5 of pgbench_accounts was different

...so does seem possible that there is some bug being tickled here. 
Unfortunately the test framework blasts away the failed tables and 
subscription and continues on...I'm going to amend it to stop on failure 
so I can have a closer look at what happened.

regards

Mark



On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
On 28/05/17 19:01, Mark Kirkwood wrote:


So running in cloud land now...so for no errors - will update.




The framework ran 600 tests last night, and I see 3 'NOK' results, i.e 3 failed test runs (all scale 25 and 8 pgbench clients). Given the way the test decides on failure (gets tired of waiting for the table md5's to match) - it begs the question 'What if it had waited a bit longer'? However from what I can see in all cases:

- the rowcounts were the same in master and replica
- the md5 of pgbench_accounts was different

All four tables should be wrong if there is still a transaction it is waiting for, as all the changes happen in a single transaction.  

I also got a failure, after 87 iterations of a similar test case.  It waited for hours, as mine requires manual intervention to stop waiting.  On the subscriber, one account still had a zero balance, while the history table on the subscriber agreed with both history and accounts on the publisher and the account should not have been zero, so definitely a transaction atomicity got busted.

I altered the script to also save the tellers and branches tables and repeated the runs, but so far it hasn't failed again in over 800 iterations using the altered script.
   

...so does seem possible that there is some bug being tickled here. Unfortunately the test framework blasts away the failed tables and subscription and continues on...I'm going to amend it to stop on failure so I can have a closer look at what happened.

What would you want to look at?  Would saving the WAL from the master be helpful?

Cheers,

Jeff

Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-29 00:17, Mark Kirkwood wrote:
> On 28/05/17 19:01, Mark Kirkwood wrote:
> 
>> So running in cloud land now...so for no errors - will update.
> 
> The framework ran 600 tests last night, and I see 3 'NOK' results, i.e
> 3 failed test runs (all scale 25 and 8 pgbench clients). Given the way

Could you also give the params for the successful runs?

Can you say anything about hardware?  (My experience is that older, 
slower, 'worse' hardware makes for more fails.)


Many thanks, by the way.  I'm glad that it turns out I'm probably not 
doing something uniquely stupid (although I'm not glad that there seems 
to be a bug, and an elusive one at that)


Erik Rijkers



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-29 03:33, Jeff Janes wrote:
> On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood <
> mark.kirkwood@catalyst.net.nz> wrote:
> 
> I also got a failure, after 87 iterations of a similar test case.  It
[...]
> repeated the runs, but so far it hasn't failed again in over 800 
> iterations

Could you give the params for the successful runs?
(ideally, a  grep | sort | uniq -c  of the ran  pgbench lines )


Can you say anything about hardware?


Thanks for repeating my lengthy tests.


Erik Rijkers



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Mark Kirkwood
Date:
On 29/05/17 16:26, Erik Rijkers wrote:

> On 2017-05-29 00:17, Mark Kirkwood wrote:
>> On 28/05/17 19:01, Mark Kirkwood wrote:
>>
>>> So running in cloud land now...so for no errors - will update.
>>
>> The framework ran 600 tests last night, and I see 3 'NOK' results, i.e
>> 3 failed test runs (all scale 25 and 8 pgbench clients). Given the way
>
> Could you also give the params for the successful runs?

Scale 25, clients 90 and 64, scale 5 clients 90, 64 and 8 (i.e the 
defaults in the driver script).
>
> Can you say anything about hardware?  (My experience is that older, 
> slower, 'worse' hardware makes for more fails.)

It's running in a openstack cloud (so is a libvirt guest): 4 cpus, 4 GB 
ram and 2 disks: one for each Postgres instance, both formatted xfs. Hmm 
so maybe I should run a VM on my workstation and crank the IOPS limit 
way down...in the meantime I'll just let it run :-)
>
>
> Many thanks, by the way.  I'm glad that it turns out I'm probably not 
> doing something uniquely stupid (although I'm not glad that there 
> seems to be a bug, and an elusive one at that)
>
>

Yeah looks like something subtle :-( Hopefully now its out in the open 
we'll all figure it together!

regards

Mark



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Mark Kirkwood
Date:

On 29/05/17 13:33, Jeff Janes wrote:
> On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood 
> <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>> 
> wrote:
>
>     On 28/05/17 19:01, Mark Kirkwood wrote:
>
>
>         So running in cloud land now...so for no errors - will update.
>
>
>
>
>     The framework ran 600 tests last night, and I see 3 'NOK' results,
>     i.e 3 failed test runs (all scale 25 and 8 pgbench clients). Given
>     the way the test decides on failure (gets tired of waiting for the
>     table md5's to match) - it begs the question 'What if it had
>     waited a bit longer'? However from what I can see in all cases:
>
>     - the rowcounts were the same in master and replica
>     - the md5 of pgbench_accounts was different
>
>
> All four tables should be wrong if there is still a transaction it is 
> waiting for, as all the changes happen in a single transaction.
>

Yeah, my thought exactly.

> I also got a failure, after 87 iterations of a similar test case.  It 
> waited for hours, as mine requires manual intervention to stop 
> waiting.  On the subscriber, one account still had a zero balance, 
> while the history table on the subscriber agreed with both history and 
> accounts on the publisher and the account should not have been zero, 
> so definitely a transaction atomicity got busted.
>

Interesting, great that we are seeing the same thing.

> I altered the script to also save the tellers and branches tables and 
> repeated the runs, but so far it hasn't failed again in over 800 
> iterations using the altered script.
>
>
>     ...so does seem possible that there is some bug being tickled
>     here. Unfortunately the test framework blasts away the failed
>     tables and subscription and continues on...I'm going to amend it
>     to stop on failure so I can have a closer look at what happened.
>
>
> What would you want to look at?  Would saving the WAL from the master 
> be helpful?
>
>

Good question - I initially wanted to see if anything changed if I 
waited longer (which you have already figured out) and what was actually 
wrong with the accounts record (which you have gotten to as well)! Nice. 
After that, I'd thought of doing another transaction on an accounts 
record that lives in the same page as the 'gammy' one on the master - 
generally poke about and see what is happening :-)

regards

Mark



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Petr Jelinek
Date:
On 29/05/17 03:33, Jeff Janes wrote:
> On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>
> wrote:
> 
>     The framework ran 600 tests last night, and I see 3 'NOK' results,
>     i.e 3 failed test runs (all scale 25 and 8 pgbench clients). Given
>     the way the test decides on failure (gets tired of waiting for the
>     table md5's to match) - it begs the question 'What if it had waited
>     a bit longer'? However from what I can see in all cases:
> 
>     - the rowcounts were the same in master and replica
>     - the md5 of pgbench_accounts was different
> 
> 
> All four tables should be wrong if there is still a transaction it is
> waiting for, as all the changes happen in a single transaction.  

Not necessarily, if the bug is in the sync worker or in the sync to
apply handover code (which is one of the more complicated parts of all
of the logical replication, so it's prime candidate) then it can easily
be just one table.

> I also got a failure, after 87 iterations of a similar test case.  It
> waited for hours, as mine requires manual intervention to stop waiting. 
> On the subscriber, one account still had a zero balance, while the
> history table on the subscriber agreed with both history and accounts on
> the publisher and the account should not have been zero, so definitely a
> transaction atomicity got busted.

I am glad others are able to reproduce this, my machine is still at 0
failures after 800 cycles.

> 
> What would you want to look at?  Would saving the WAL from the master be
> helpful?
> 

Useful info is, logs from provider (mainly the logical decoding logs
that mention LSNs), logs from subscriber (the lines about when sync
workers finished), contents of the pg_subscription_rel (with srrelid
casted to regclass so we know which table is which), and pg_waldump
output around the LSNs found in the logs and in the pg_subscription_rel
(+ few lines before and some after to get context). It's enough to only
care about LSNs for the table(s) that are out of sync.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Mark Kirkwood
Date:
On 29/05/17 23:14, Petr Jelinek wrote:

> On 29/05/17 03:33, Jeff Janes wrote:
>
>> What would you want to look at?  Would saving the WAL from the master be
>> helpful?
>>
> Useful info is, logs from provider (mainly the logical decoding logs
> that mention LSNs), logs from subscriber (the lines about when sync
> workers finished), contents of the pg_subscription_rel (with srrelid
> casted to regclass so we know which table is which), and pg_waldump
> output around the LSNs found in the logs and in the pg_subscription_rel
> (+ few lines before and some after to get context). It's enough to only
> care about LSNs for the table(s) that are out of sync.
>

I have a run that aborted with failure (accounts table md5 mismatch). 
Petr - would you like to have access to the machine ? If so send me you 
public key and I'll set it up.

Cheers

Mark



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-26 08:10, Erik Rijkers wrote:

> If you run a pgbench session of 1 minute over a logical replication
> connection and repeat that 100x this is what you get:
> 
> At clients 90, 64, 8, scale 25:
> -- out_20170525_0944.txt
>     100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25
>       7 -- Not good.
> -- out_20170525_1426.txt
>     100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n   --  scale 25
>      18 -- Not good.
> -- out_20170525_2049.txt
>     100 -- pgbench -c 8 -j 8 -T 60 -P 12 -n   --  scale 25
>      10 -- Not good.
> At clients 90, 64, 8, scale 5:
> -- out_20170526_0126.txt
>     100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 5
>       2 -- Not good.
> -- out_20170526_0352.txt
>     100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n   --  scale 5
>       3 -- Not good.
> -- out_20170526_0621.txt
>     100 -- pgbench -c 8 -j 8 -T 60 -P 12 -n   --  scale 5
>       4 -- Not good.

It seems this problem is a bit less serious than it did look to me (as 
others find lower numbers of fail).

Still, how is its seriousness graded by now?  Is it a show-stopper?  
Should it go onto the Open Items page?

Is anyone still looking into it?


thanks,

Erik Rijkers




> The above installations (master+replica) are with Petr Jelinek's (and
> Michael Paquier's) last patches
>  0001-Fix-signal-handling-in-logical-workers.patch
>  0002-Make-tablesync-worker-exit-when-apply-dies-while-it-.patch
>  0003-Receive-invalidation-messages-correctly-in-tablesync.patch
>  Remove-the-SKIP-REFRESH-syntax-suggar-in-ALTER-SUBSC-v2.patch




Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Petr Jelinek
Date:
On 29/05/17 23:06, Mark Kirkwood wrote:
> On 29/05/17 23:14, Petr Jelinek wrote:
> 
>> On 29/05/17 03:33, Jeff Janes wrote:
>>
>>> What would you want to look at?  Would saving the WAL from the master be
>>> helpful?
>>>
>> Useful info is, logs from provider (mainly the logical decoding logs
>> that mention LSNs), logs from subscriber (the lines about when sync
>> workers finished), contents of the pg_subscription_rel (with srrelid
>> casted to regclass so we know which table is which), and pg_waldump
>> output around the LSNs found in the logs and in the pg_subscription_rel
>> (+ few lines before and some after to get context). It's enough to only
>> care about LSNs for the table(s) that are out of sync.
>>
> 
> I have a run that aborted with failure (accounts table md5 mismatch).
> Petr - would you like to have access to the machine ? If so send me you
> public key and I'll set it up.

Thanks to Mark's offer I was able to study the issue as it happened and
found the cause of this.

The busy loop in apply stops at the point when worker shmem state
indicates that table synchronization was finished, but that might not be
visible in the next transaction if it takes long to flush the final
commit to disk so we might ignore couple of transactions for given table
in the main apply because we think it's still being synchronized. This
also explains why I could not reproduce it on my testing machine (fast
ssd disk array where flushes were always fast) and why it happens
relatively rarely because it's one specific commit during the whole
synchronization process that needs to be slow.

So as solution I changed the busy loop in the apply to wait for
in-catalog status rather than in-memory status to make sure things are
really there and flushed.

While working on this I realized that the handover itself is bit more
complex than necessary (especially for debugging and for other people
understanding it) so I did some small changes as part of this patch to
make the sequences of states table goes during synchronization process
to always be the same. This might cause unnecessary update per one table
synchronization process in some cases but that seems like small enough
price to pay for clearer logic. And it also fixes another potential bug
that I identified where we might write wrong state to catalog if main
apply crashed while sync worker was waiting for status update.

I've been running tests on this overnight on another machine where I was
able to reproduce  the original issue within few runs (once I found what
causes it) and so far looks good.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-31 11:16, Petr Jelinek wrote:
[...]
> Thanks to Mark's offer I was able to study the issue as it happened and
> found the cause of this.
> 
> [0001-Improve-handover-logic-between-sync-and-apply-worker.patch]

This looks good:

-- out_20170531_1141.txt    100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25    100 -- All is well.

So this is 100x a 1-minute test with 100x success. (This on the most 
fastidious machine (slow disks, meagre specs) that used to give 15% 
failures)

I'll let it run for a couple of days with varying params (and on varying 
hardware) but it definitely does look as if you fixed it.

Thanks!

Erik Rijkers



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Peter Eisentraut
Date:
On 5/31/17 05:16, Petr Jelinek wrote:
> I've been running tests on this overnight on another machine where I was
> able to reproduce  the original issue within few runs (once I found what
> causes it) and so far looks good.

I'll give people another day or so to test this before committing.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Mark Kirkwood
Date:
On 31/05/17 21:16, Petr Jelinek wrote:

> On 29/05/17 23:06, Mark Kirkwood wrote:
>> On 29/05/17 23:14, Petr Jelinek wrote:
>>
>>> On 29/05/17 03:33, Jeff Janes wrote:
>>>
>>>> What would you want to look at?  Would saving the WAL from the master be
>>>> helpful?
>>>>
>>> Useful info is, logs from provider (mainly the logical decoding logs
>>> that mention LSNs), logs from subscriber (the lines about when sync
>>> workers finished), contents of the pg_subscription_rel (with srrelid
>>> casted to regclass so we know which table is which), and pg_waldump
>>> output around the LSNs found in the logs and in the pg_subscription_rel
>>> (+ few lines before and some after to get context). It's enough to only
>>> care about LSNs for the table(s) that are out of sync.
>>>
>> I have a run that aborted with failure (accounts table md5 mismatch).
>> Petr - would you like to have access to the machine ? If so send me you
>> public key and I'll set it up.
> Thanks to Mark's offer I was able to study the issue as it happened and
> found the cause of this.
>
> The busy loop in apply stops at the point when worker shmem state
> indicates that table synchronization was finished, but that might not be
> visible in the next transaction if it takes long to flush the final
> commit to disk so we might ignore couple of transactions for given table
> in the main apply because we think it's still being synchronized. This
> also explains why I could not reproduce it on my testing machine (fast
> ssd disk array where flushes were always fast) and why it happens
> relatively rarely because it's one specific commit during the whole
> synchronization process that needs to be slow.
>
> So as solution I changed the busy loop in the apply to wait for
> in-catalog status rather than in-memory status to make sure things are
> really there and flushed.
>
> While working on this I realized that the handover itself is bit more
> complex than necessary (especially for debugging and for other people
> understanding it) so I did some small changes as part of this patch to
> make the sequences of states table goes during synchronization process
> to always be the same. This might cause unnecessary update per one table
> synchronization process in some cases but that seems like small enough
> price to pay for clearer logic. And it also fixes another potential bug
> that I identified where we might write wrong state to catalog if main
> apply crashed while sync worker was waiting for status update.
>
> I've been running tests on this overnight on another machine where I was
> able to reproduce  the original issue within few runs (once I found what
> causes it) and so far looks good.
>
>
>

I'm seeing a new failure with the patch applied - this time the history 
table has missing rows. Petr, I'll put back your access :-)

regards

Mark




Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-06-02 00:46, Mark Kirkwood wrote:
> On 31/05/17 21:16, Petr Jelinek wrote:
> 
> I'm seeing a new failure with the patch applied - this time the
> history table has missing rows. Petr, I'll put back your access :-)

Is this error during 1-minute runs?

I'm asking because I've moved back to longer (1-hour) runs (no errors so 
far), and I'd like to keep track of what the most 'vulnerable' 
parameters are.

thanks,

Erik Rijkers



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Mark Kirkwood
Date:
On 02/06/17 17:11, Erik Rijkers wrote:

> On 2017-06-02 00:46, Mark Kirkwood wrote:
>> On 31/05/17 21:16, Petr Jelinek wrote:
>>
>> I'm seeing a new failure with the patch applied - this time the
>> history table has missing rows. Petr, I'll put back your access :-)
>
> Is this error during 1-minute runs?
>
> I'm asking because I've moved back to longer (1-hour) runs (no errors 
> so far), and I'd like to keep track of what the most 'vulnerable' 
> parameters are.
>

Yeah, still using your test config (with my minor modifications).

When I got the error the 1st time, I did a complete make clean and 
rebuild....but it is still possible I've 'done it wrong' - so 
independent confirmation would be good!

Cheers

Mark



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Petr Jelinek
Date:
On 02/06/17 08:55, Mark Kirkwood wrote:
> On 02/06/17 17:11, Erik Rijkers wrote:
> 
>> On 2017-06-02 00:46, Mark Kirkwood wrote:
>>> On 31/05/17 21:16, Petr Jelinek wrote:
>>>
>>> I'm seeing a new failure with the patch applied - this time the
>>> history table has missing rows. Petr, I'll put back your access :-)
>>
>> Is this error during 1-minute runs?
>>
>> I'm asking because I've moved back to longer (1-hour) runs (no errors
>> so far), and I'd like to keep track of what the most 'vulnerable'
>> parameters are.
>>
> 
> Yeah, still using your test config (with my minor modifications).
> 
> When I got the error the 1st time, I did a complete make clean and
> rebuild....but it is still possible I've 'done it wrong' - so
> independent confirmation would be good!

Well, I've seen this issue as well while I was developing the fix, but
the patch I proposed fixed it for me as well as the original issue.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Petr Jelinek
Date:
On 02/06/17 22:29, Petr Jelinek wrote:
> On 02/06/17 08:55, Mark Kirkwood wrote:
>> On 02/06/17 17:11, Erik Rijkers wrote:
>>
>>> On 2017-06-02 00:46, Mark Kirkwood wrote:
>>>> On 31/05/17 21:16, Petr Jelinek wrote:
>>>>
>>>> I'm seeing a new failure with the patch applied - this time the
>>>> history table has missing rows. Petr, I'll put back your access :-)
>>>
>>> Is this error during 1-minute runs?
>>>
>>> I'm asking because I've moved back to longer (1-hour) runs (no errors
>>> so far), and I'd like to keep track of what the most 'vulnerable'
>>> parameters are.
>>>
>>
>> Yeah, still using your test config (with my minor modifications).
>>
>> When I got the error the 1st time, I did a complete make clean and
>> rebuild....but it is still possible I've 'done it wrong' - so
>> independent confirmation would be good!
> 
> Well, I've seen this issue as well while I was developing the fix, but
> the patch I proposed fixed it for me as well as the original issue.
> 

While I was testing something for different thread I noticed that I
manage transactions incorrectly in this patch. Fixed here, I didn't test
it much yet (it takes a while as you know :) ). Not sure if it's related
to the issue you've seen though.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Mark Kirkwood
Date:
On 03/06/17 11:10, Petr Jelinek wrote:

> On 02/06/17 22:29, Petr Jelinek wrote:
>> On 02/06/17 08:55, Mark Kirkwood wrote:
>>> On 02/06/17 17:11, Erik Rijkers wrote:
>>>
>>>> On 2017-06-02 00:46, Mark Kirkwood wrote:
>>>>> On 31/05/17 21:16, Petr Jelinek wrote:
>>>>>
>>>>> I'm seeing a new failure with the patch applied - this time the
>>>>> history table has missing rows. Petr, I'll put back your access :-)
>>>> Is this error during 1-minute runs?
>>>>
>>>> I'm asking because I've moved back to longer (1-hour) runs (no errors
>>>> so far), and I'd like to keep track of what the most 'vulnerable'
>>>> parameters are.
>>>>
>>> Yeah, still using your test config (with my minor modifications).
>>>
>>> When I got the error the 1st time, I did a complete make clean and
>>> rebuild....but it is still possible I've 'done it wrong' - so
>>> independent confirmation would be good!
>> Well, I've seen this issue as well while I was developing the fix, but
>> the patch I proposed fixed it for me as well as the original issue.
>>
> While I was testing something for different thread I noticed that I
> manage transactions incorrectly in this patch. Fixed here, I didn't test
> it much yet (it takes a while as you know :) ). Not sure if it's related
> to the issue you've seen though.
>
>
Ok - I've applied this version, and running tests again. I needed to do 
a git pull to apply the patch, so getting some other changes too!

Cheers

Mark




Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Petr Jelinek
Date:
On 03/06/17 04:45, Mark Kirkwood wrote:
> On 03/06/17 11:10, Petr Jelinek wrote:
> 
>> On 02/06/17 22:29, Petr Jelinek wrote:
>>> On 02/06/17 08:55, Mark Kirkwood wrote:
>>>> On 02/06/17 17:11, Erik Rijkers wrote:
>>>>
>>>>> On 2017-06-02 00:46, Mark Kirkwood wrote:
>>>>>> On 31/05/17 21:16, Petr Jelinek wrote:
>>>>>>
>>>>>> I'm seeing a new failure with the patch applied - this time the
>>>>>> history table has missing rows. Petr, I'll put back your access :-)
>>>>> Is this error during 1-minute runs?
>>>>>
>>>>> I'm asking because I've moved back to longer (1-hour) runs (no errors
>>>>> so far), and I'd like to keep track of what the most 'vulnerable'
>>>>> parameters are.
>>>>>
>>>> Yeah, still using your test config (with my minor modifications).
>>>>
>>>> When I got the error the 1st time, I did a complete make clean and
>>>> rebuild....but it is still possible I've 'done it wrong' - so
>>>> independent confirmation would be good!
>>> Well, I've seen this issue as well while I was developing the fix, but
>>> the patch I proposed fixed it for me as well as the original issue.
>>>
>> While I was testing something for different thread I noticed that I
>> manage transactions incorrectly in this patch. Fixed here, I didn't test
>> it much yet (it takes a while as you know :) ). Not sure if it's related
>> to the issue you've seen though.
>>
>>
> Ok - I've applied this version, and running tests again. I needed to do
> a git pull to apply the patch, so getting some other changes too!
> 

Thanks, yes, I forgot to mention that I rebased it against the current
HEAD as well.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services




On Fri, Jun 2, 2017 at 4:10 PM, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

While I was testing something for different thread I noticed that I
manage transactions incorrectly in this patch. Fixed here, I didn't test
it much yet (it takes a while as you know :) ). Not sure if it's related
to the issue you've seen though.

This conflicts again with Peter E's recent commit 3c9bc2157a4f465b3c070d1250597568d2dc285f

Thanks,

Jeff

Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-31 16:20, Erik Rijkers wrote:
> On 2017-05-31 11:16, Petr Jelinek wrote:
> [...]
>> Thanks to Mark's offer I was able to study the issue as it happened 
>> and
>> found the cause of this.
>> 
>> [0001-Improve-handover-logic-between-sync-and-apply-worker.patch]
> 
> This looks good:
> 
> -- out_20170531_1141.txt
>     100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25
>     100 -- All is well.
> 
> So this is 100x a 1-minute test with 100x success. (This on the most
> fastidious machine (slow disks, meagre specs) that used to give 15%
> failures)

[Improve-handover-logic-between-sync-and-apply-worker-v2.patch]

No errors after (several days of) running variants of this. (2500x 1 
minute runs; 12x 1-hour runs)

> 
> Thanks!
> 
> Erik Rijkers



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Mark Kirkwood
Date:
On 05/06/17 00:04, Erik Rijkers wrote:

> On 2017-05-31 16:20, Erik Rijkers wrote:
>> On 2017-05-31 11:16, Petr Jelinek wrote:
>> [...]
>>> Thanks to Mark's offer I was able to study the issue as it happened and
>>> found the cause of this.
>>>
>>> [0001-Improve-handover-logic-between-sync-and-apply-worker.patch]
>>
>> This looks good:
>>
>> -- out_20170531_1141.txt
>>     100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25
>>     100 -- All is well.
>>
>> So this is 100x a 1-minute test with 100x success. (This on the most
>> fastidious machine (slow disks, meagre specs) that used to give 15%
>> failures)
>
> [Improve-handover-logic-between-sync-and-apply-worker-v2.patch]
>
> No errors after (several days of) running variants of this. (2500x 1 
> minute runs; 12x 1-hour runs)

Same here, no errors with the v2 patch applied (approx 2 days - all 1 
minute runs)

regards

Mark



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Mark Kirkwood
Date:

On 05/06/17 13:08, Mark Kirkwood wrote:
> On 05/06/17 00:04, Erik Rijkers wrote:
>
>> On 2017-05-31 16:20, Erik Rijkers wrote:
>>> On 2017-05-31 11:16, Petr Jelinek wrote:
>>> [...]
>>>> Thanks to Mark's offer I was able to study the issue as it happened 
>>>> and
>>>> found the cause of this.
>>>>
>>>> [0001-Improve-handover-logic-between-sync-and-apply-worker.patch]
>>>
>>> This looks good:
>>>
>>> -- out_20170531_1141.txt
>>>     100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25
>>>     100 -- All is well.
>>>
>>> So this is 100x a 1-minute test with 100x success. (This on the most
>>> fastidious machine (slow disks, meagre specs) that used to give 15%
>>> failures)
>>
>> [Improve-handover-logic-between-sync-and-apply-worker-v2.patch]
>>
>> No errors after (several days of) running variants of this. (2500x 1 
>> minute runs; 12x 1-hour runs)
>
> Same here, no errors with the v2 patch applied (approx 2 days - all 1 
> minute runs)
>

Further, reapplying the v1 patch (with a bit of editing as I wanted to 
apply it to my current master), gets a failure with missing rows in the 
history table quite quickly. I'll put back the v2 patch and resume runs 
with that, but I'm cautiously optimistic that the v2 patch solves the issue.

regards

Mark




Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Petr Jelinek
Date:
On 03/06/17 16:12, Jeff Janes wrote:
> 
> On Fri, Jun 2, 2017 at 4:10 PM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com <mailto:petr.jelinek@2ndquadrant.com>> wrote:
> 
> 
>     While I was testing something for different thread I noticed that I
>     manage transactions incorrectly in this patch. Fixed here, I didn't test
>     it much yet (it takes a while as you know :) ). Not sure if it's related
>     to the issue you've seen though.
> 
> 
> This conflicts again with Peter E's recent commit
> 3c9bc2157a4f465b3c070d1250597568d2dc285f
> 

Rebased on top of current HEAD.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Peter Eisentraut
Date:
On 6/4/17 22:38, Petr Jelinek wrote:
> On 03/06/17 16:12, Jeff Janes wrote:
>>
>> On Fri, Jun 2, 2017 at 4:10 PM, Petr Jelinek
>> <petr.jelinek@2ndquadrant.com <mailto:petr.jelinek@2ndquadrant.com>> wrote:
>>
>>
>>     While I was testing something for different thread I noticed that I
>>     manage transactions incorrectly in this patch. Fixed here, I didn't test
>>     it much yet (it takes a while as you know :) ). Not sure if it's related
>>     to the issue you've seen though.
>>
>>
>> This conflicts again with Peter E's recent commit
>> 3c9bc2157a4f465b3c070d1250597568d2dc285f
>>
> 
> Rebased on top of current HEAD.

Committed that, with some further updates of comments to reflect the
changes.

I do like the simplification of the state progression.

Perhaps it could be simplified even further, by eliminating the SYNCDONE
setting in LogicalRepSyncTableStart() and making it go through the apply
loop and end up in process_syncing_tables_for_sync() in all cases.
Which is kind of what the comments at the top of the file suggest would
happen anyway.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-06-06 20:53, Peter Eisentraut wrote:
> On 6/4/17 22:38, Petr Jelinek wrote:
> 
> Committed that, with some further updates of comments to reflect the

Belated apologies all round for the somewhat provocative $subject; but I 
felt at that moment that this item needed some extra attention.

I don't know if it worked but I'm glad that it is solved ;)

Thanks,

Erik Rijkers






Re: [HACKERS] logical replication - still unstable after all these months

From
Robert Haas
Date:
On Tue, Jun 6, 2017 at 3:01 PM, Erik Rijkers <er@xs4all.nl> wrote:
> Belated apologies all round for the somewhat provocative $subject; but I
> felt at that moment that this item needed some extra attention.

FWIW, it seemed like a pretty fair subject line to me given your test
results.  I think it's in everyone's interest to get this feature
stabilized before we ship a final release - people will rely on it,
and if it drops even one row even occasionally, it's going to be a big
problem for our users.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Petr Jelinek
Date:
On 06/06/17 21:09, Robert Haas wrote:
> On Tue, Jun 6, 2017 at 3:01 PM, Erik Rijkers <er@xs4all.nl> wrote:
>> Belated apologies all round for the somewhat provocative $subject; but I
>> felt at that moment that this item needed some extra attention.
> 
> FWIW, it seemed like a pretty fair subject line to me given your test
> results.  I think it's in everyone's interest to get this feature
> stabilized before we ship a final release - people will rely on it,
> and if it drops even one row even occasionally, it's going to be a big
> problem for our users.
> 

+1, I considered the issues solved when snapshot builder issues were
fixed so it was good reminder to check again.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-05-28 12:44, Erik Rijkers wrote:

re: srsubstate in pg_subscription_rel:

> No idea what it means.  At the very least this value 'w' is missing
> from the documentation, which only mentions:
>   i = initalize
>   d = data copy
>   s = synchronized
>   r = (normal replication)

Shouldn't we add this to that table (51.53) in the documentation?

After all, the value 'w' does show up when you monitor 
pg_subscription_rel.









Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Peter Eisentraut
Date:
On 6/17/17 06:48, Erik Rijkers wrote:
> On 2017-05-28 12:44, Erik Rijkers wrote:
> 
> re: srsubstate in pg_subscription_rel:
> 
>> No idea what it means.  At the very least this value 'w' is missing
>> from the documentation, which only mentions:
>>   i = initalize
>>   d = data copy
>>   s = synchronized
>>   r = (normal replication)
> 
> Shouldn't we add this to that table (51.53) in the documentation?
> 
> After all, the value 'w' does show up when you monitor 
> pg_subscription_rel.

It's not supposed to.  Have you seen it after
e3a815d2faa5be28551e71d5db44fb2c78133433?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] logical replication - still unstable after all thesemonths

From
Erik Rijkers
Date:
On 2017-06-18 00:27, Peter Eisentraut wrote:
> On 6/17/17 06:48, Erik Rijkers wrote:
>> On 2017-05-28 12:44, Erik Rijkers wrote:
>> 
>> re: srsubstate in pg_subscription_rel:
>> 
>>> No idea what it means.  At the very least this value 'w' is missing
>>> from the documentation, which only mentions:
>>>   i = initalize
>>>   d = data copy
>>>   s = synchronized
>>>   r = (normal replication)
>> 
>> Shouldn't we add this to that table (51.53) in the documentation?
>> 
>> After all, the value 'w' does show up when you monitor
>> pg_subscription_rel.
> 
> It's not supposed to.  Have you seen it after
> e3a815d2faa5be28551e71d5db44fb2c78133433?

Ah no, I haven't seen that 'w'-value after that (and 1000s of tests ran 
without error since then).

I just hadn't realized that that w-value I had reported was indeed a 
erroneous state.

thanks, this is OK then.

Erik Rijkers