Thread: [HACKERS] logical replication - still unstable after all these months
If you run a pgbench session of 1 minute over a logical replication connection and repeat that 100x this is what you get: At clients 90, 64, 8, scale 25: -- out_20170525_0944.txt 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25 93 -- All is well. 7 -- Notgood. -- out_20170525_1426.txt 100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n -- scale 25 82 -- All is well. 18 -- Notgood. -- out_20170525_2049.txt 100 -- pgbench -c 8 -j 8 -T 60 -P 12 -n -- scale 25 90 -- All is well. 10 -- Not good At clients 90, 64, 8, scale 25: -- out_20170526_0126.txt 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 5 98 -- All is well. 2 -- Not good. -- out_20170526_0352.txt 100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n -- scale 5 97 -- All is well. 3 -- Not good. -- out_20170526_0621.txt 45 -- pgbench -c 8 -j 8 -T 60 -P 12 -n -- scale 5 41 -- All is well. 3 -- Not good. (That last one obviously not finished) I think this is pretty awful, really, for a beta level. The above installations (master+replica) are with Petr Jelinek's (and Michael Paquier's) last patches 0001-Fix-signal-handling-in-logical-workers.patch 0002-Make-tablesync-worker-exit-when-apply-dies-while-it-.patch 0003-Receive-invalidation-messages-correctly-in-tablesync.patch Remove-the-SKIP-REFRESH-syntax-suggar-in-ALTER-SUBSC-v2.patch Now, it could be that there is somehow something wrong with my test-setup (as opposed to some bug in log-repl). I can post my test program, but I'll do that separately (but below is the core all my tests -- it's basically still that very first test that I started out with, many months ago...) I'd like to find out/know more about: - Do you agree this number of failures is far too high? - Am I the only one finding so many failures? - Is anyone else testing the same way (more or less continually, finding only succes)? - Which of the Open Items could be resposible for this failure rate? (I don't see a match.) - What tests do others do? Could we somehow concentrate results and method somewhere? Thanks, Erik Rijkers PS The core of the 'pgbench_derail' test (bash) is simply: echo "drop table if exists pgbench_accounts; drop table if exists pgbench_branches; drop table if exists pgbench_tellers; drop table if exists pgbench_history;" | psql -qXp $port1 \ && echo "drop table if exists pgbench_accounts; drop table if exists pgbench_branches; drop table if exists pgbench_tellers; drop table if exists pgbench_history;" | psql -qXp $port2 \ && pgbench -p $port1 -qis $scale \ && echo "alter table pgbench_history add column hid serial primary key;" \ | psql -q1Xp $port1 && pg_dump -F c -p $port1 \ --exclude-table-data=pgbench_history \ --exclude-table-data=pgbench_accounts\ --exclude-table-data=pgbench_branches \ --exclude-table-data=pgbench_tellers \ -t pgbench_history -t pgbench_accounts \ -t pgbench_branches -t pgbench_tellers\ | pg_restore -1 -p $port2 -d testdb appname=derail2 echo "create publication pub1 for all tables;" | psql -p $port1 -aqtAX echo "create subscription sub1 connection 'port=${port1} application_name=$appname' publication pub1 with(enabled=false); alter subscription sub1 enable;" | psql -p $port2 -aqtAX pgbench -c $clients -j $threads -T $duration -P $pseconds -n # scale $scale Now compare md5's of the sorted content of each of the 4 pgbench tables on primary and replica. They should be the same.
On 26 May 2017 at 07:10, Erik Rijkers <er@xs4all.nl> wrote: > - Do you agree this number of failures is far too high? > - Am I the only one finding so many failures? What type of failure are you getting? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-05-26 08:58, Simon Riggs wrote: > On 26 May 2017 at 07:10, Erik Rijkers <er@xs4all.nl> wrote: > >> - Do you agree this number of failures is far too high? >> - Am I the only one finding so many failures? > > What type of failure are you getting? The failure is that in the result state the replicated tables differ from the original tables. For instance, -- out_20170525_0944.txt 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25 93 -- All is well. 7 -- Notgood. These numbers mean: the result state of primary and replica is not the same, in 7 out of 100 runs. 'not the same state' means: at least one of the 4 md5's of the sorted content of the 4 pgbench tables on the primary is different from those taken from the replica. So, 'failure' means: the 4 pgbench tables on primary and replica are not exactly the same after the (one-minute) pgbench-run has finished, and logical replication has 'finished'. (plenty of time is given for the replica to catchup. The test only calls 'failure' after 20x waiting (for 15 seconds) and 20x finding the same erroneous state (erroneous because not-same as on primary). I would really like to know it you think that that doesn't amount to 'failure'.
On 26 May 2017 at 08:27, Erik Rijkers <er@xs4all.nl> wrote: > On 2017-05-26 08:58, Simon Riggs wrote: >> >> On 26 May 2017 at 07:10, Erik Rijkers <er@xs4all.nl> wrote: >> >>> - Do you agree this number of failures is far too high? >>> - Am I the only one finding so many failures? >> >> >> What type of failure are you getting? > > > The failure is that in the result state the replicated tables differ from > the original tables. An important point would be note that that this is time dependent. > I would really like to know it you think that that doesn't amount to > 'failure'. Yes, your test has failed. Even one record on one test is a serious problem and needs to be fixed. If we can find out what the bug is with a repeatable test case we can fix it. Could you provide more details? Thanks -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-05-26 09:40, Simon Riggs wrote: > > If we can find out what the bug is with a repeatable test case we can > fix it. > > Could you provide more details? Thanks I will, just need some time to clean things up a bit. But what I would like is for someone else to repeat my 100x1-minute tests, taking as core that snippet I posted in my previous email. I built bash-stuff around that core (to take md5's, shut-down/start-up the two instances between runs, write info to log-files, etc). But it would be good if someone else made that separately because if that then does not fail, it would prove that my test-harness is at fault (and not logical replication). The idea is simple enough: startup instance1 startup instance2 (on same machine) primary: init pgbench tables primary: add primary key to pgbench_history copy empty tables to replica by dump/restore primary: start publication replica: start subscription primary: run 1-minute pgbench wait till the 4 md5's of primary pgbench tables are the same as the 4 md5's of replica pgbench tables (this will need atime-out). log 'ok' or 'not ok' primary: clean up publication replica: clean up subscription shutdown primary shutdown replica this whole thing 100x
On 26/05/17 20:09, Erik Rijkers wrote: > On 2017-05-26 09:40, Simon Riggs wrote: >> >> If we can find out what the bug is with a repeatable test case we can >> fix it. >> >> Could you provide more details? Thanks > > I will, just need some time to clean things up a bit. > > > But what I would like is for someone else to repeat my 100x1-minute > tests, taking as core that snippet I posted in my previous email. I > built bash-stuff around that core (to take md5's, shut-down/start-up > the two instances between runs, write info to log-files, etc). But it > would be good if someone else made that separately because if that > then does not fail, it would prove that my test-harness is at fault > (and not logical replication). > Will do - what I had been doing was running pgbench, waiting until the row counts on the replica pgbench_history were the same as the primary, then summing the %balance and delta fields from the primary and replica dbs and comparing. So far - all match up ok. However I'd been running a longer time frames (5 minutes), so not the same number of repetitions as yet. regards Mark
On 2017-05-26 10:29, Mark Kirkwood wrote: > On 26/05/17 20:09, Erik Rijkers wrote: > >> On 2017-05-26 09:40, Simon Riggs wrote: >>> >>> If we can find out what the bug is with a repeatable test case we can >>> fix it. >>> >>> Could you provide more details? Thanks >> >> I will, just need some time to clean things up a bit. >> >> >> But what I would like is for someone else to repeat my 100x1-minute >> tests, taking as core that snippet I posted in my previous email. I >> built bash-stuff around that core (to take md5's, shut-down/start-up >> the two instances between runs, write info to log-files, etc). But it >> would be good if someone else made that separately because if that >> then does not fail, it would prove that my test-harness is at fault >> (and not logical replication). >> > > Will do - what I had been doing was running pgbench, waiting until the Great! You'll have to think about whether to go with instances of either master, or master+those 4 patches. I guess either choice makes sense. > row counts on the replica pgbench_history were the same as the > primary, then summing the %balance and delta fields from the primary > and replica dbs and comparing. So far - all match up ok. However I'd I did number-summing for a while as well (because it's a lot faster than taking md5's over the full content). But the problem with summing is that (I think) in the end you cannot be really sure that the result is correct (false positives, although I don't understand the odds). > been running a longer time frames (5 minutes), so not the same number > of repetitions as yet. I've run 3600-, 30- and 15-minute runs too, but in this case (these 100x tests) I wanted to especially test the area around startup/initialise of logical replication. Also the increasing quality of logical replication (once it runs with the correct thanks, Erik Rijkers
On 05/26/2017 12:57 PM, Erik Rijkers wrote: > The failure is that in the result state the replicated tables differ > from the original tables. I am also getting similar behavior Master= run pgbench with scaling factor =1 (./pg_bench -i -s 1 postgres ) delete rows from pgbench_history ( delete from pgbench_history) create publication (create publication pub for table pgbench_history) Slave= run pgbench with scaling factor =1 (./pg_bench -i -s 1 postgres -p 5000 ) delete rows from pgbench_history ( delete from pgbench_history) create subscription (create subscription sub connection 'dbname=postgres host=localhost user=centos) publication pub; create a test.sql file , having an insert statement [centos@centos-cpula bin]$ cat test.sql insert into pgbench_history values (1,1,1,1,now(),'anv'); now run pgbench with -T / -c / -j options First time = ./pgbench -t 5 -c 90 -j 90 -f test.sql postgres count on Master/slave are SAME . run second time = ./pgbench -T 20 -c 90 -j 90 -f test.sql postgres check the row count on master/standby Master= postgres=# select count(*) from pgbench_history ; count -------- 536836 (1 row) Standby = postgres=# select count(*) from pgbench_history ; count --------- 1090959 (1 row) -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
Hi, Hmm, I was under the impression that the changes we proposed in the snapbuild thread fixed your issues, does this mean they didn't? Or the modified versions of those that were eventually committed didn't? Or did issues reappear at some point? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2017-05-26 15:59, Petr Jelinek wrote: > Hi, > > Hmm, I was under the impression that the changes we proposed in the > snapbuild thread fixed your issues, does this mean they didn't? Or the > modified versions of those that were eventually committed didn't? Or > did > issues reappear at some point? I do think the snapbuild fixed solved certain problems. I can't say where the present problems are caused (as I have said, I suspect logical replication, but also my own test-harness: perhaps it leaves some error-state lying around (although I do try hard to prevent that) -- so I just don't know. I wouldn't say that problems (re)appeared at a certain point; my impression is rather that logical replication has become better and better. But I kept getting the odd failure, without a clear cause, but always (eventually) repeatable on other machines. I did the 1-minute pgbench-derail version exactly because of the earlier problems with snapbuild: I wanted a test that does a lot of starting and stopping of publication and subscription. Erik Rijkers
Erik Rijkers wrote: > I wouldn't say that problems (re)appeared at a certain point; my impression > is rather that logical replication has become better and better. But I kept > getting the odd failure, without a clear cause, but always (eventually) > repeatable on other machines. I did the 1-minute pgbench-derail version > exactly because of the earlier problems with snapbuild: I wanted a test that > does a lot of starting and stopping of publication and subscription. I think it is pretty unlikely that the logical replication plumbing is the buggy place. You're just seeing it now becaues we didn't have any mechanism as convenient to consume logical decoding output. In other words, I strongly suspect that the hyphothetical bugs are in the logical decoding side (and snapbuild sounds the most promising candidate) rather than logical replication per se. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 26/05/17 16:51, Alvaro Herrera wrote: > Erik Rijkers wrote: > >> I wouldn't say that problems (re)appeared at a certain point; my impression >> is rather that logical replication has become better and better. But I kept >> getting the odd failure, without a clear cause, but always (eventually) >> repeatable on other machines. I did the 1-minute pgbench-derail version >> exactly because of the earlier problems with snapbuild: I wanted a test that >> does a lot of starting and stopping of publication and subscription. > > I think it is pretty unlikely that the logical replication plumbing is > the buggy place. You're just seeing it now becaues we didn't have any > mechanism as convenient to consume logical decoding output. In other > words, I strongly suspect that the hyphothetical bugs are in the logical > decoding side (and snapbuild sounds the most promising candidate) rather > than logical replication per se. > Well, that was true for the previous issues Erik found as well (mostly snapshot builder was problematic). But that does not mean there are no issues elsewhere. We could do with some more output from the tests (do they log some intermediary states of those md5 checksums, maybe numbers of rows etc?), description of the problems, errors from logs, etc. I for example don't get any issues from similar test as the one described in this thread so without more info it might be hard to reproduce and fix whatever the underlaying issue is. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, May 26, 2017 at 5:17 AM, tushar <tushar.ahuja@enterprisedb.com> wrote:
run second time =
./pgbench -T 20 -c 90 -j 90 -f test.sql postgres
check the row count on master/standby
Master=
postgres=# select count(*) from pgbench_history ;
count
--------
536836
(1 row)
Standby =
postgres=# select count(*) from pgbench_history ;
count
---------
1090959
(1 row)
Hi Tushar,
pgbench starts out by truncating pgbench_history. That truncation does not get replicated to the subscriber. The later inserts do. So your subscriber ends up with about twice as many rows.
Cheers,
Jeff
On Fri, May 26, 2017 at 12:27 AM, Erik Rijkers <er@xs4all.nl> wrote:
On 2017-05-26 08:58, Simon Riggs wrote:On 26 May 2017 at 07:10, Erik Rijkers <er@xs4all.nl> wrote:- Do you agree this number of failures is far too high?
- Am I the only one finding so many failures?
What type of failure are you getting?
The failure is that in the result state the replicated tables differ from the original tables.
But what is the actual failure? Which tables differ? Do they have the same number of rows? Do they only differ in the *balance column or something else? Are they transactionally consistent?
I have not been able to replicate the problem.
Cheers,
Jeff
On 26/05/17 20:09, Erik Rijkers wrote: > > The idea is simple enough: > > startup instance1 > startup instance2 (on same machine) > primary: init pgbench tables > primary: add primary key to pgbench_history > copy empty tables to replica by dump/restore > primary: start publication > replica: start subscription > primary: run 1-minute pgbench > wait till the 4 md5's of primary pgbench tables > are the same as the 4 md5's of replica pgbench > tables (this will need a time-out). > log 'ok' or 'not ok' > primary: clean up publication > replica: clean up subscription > shutdown primary > shutdown replica > > this whole thing 100 I might have a look at scripting this up (especially if it keeps raining here)... Some questions that might help me get it right: - do you think we need to stop and start the instances every time? - do we need to init pgbench each time? - could we just drop the subscription and publication and truncate the replica tables instead? - what scale pgbench are you running? - how many clients for the 1 min pgbench run? - are you starting the pgbench run while the copy_data jobs for the subscription are still running? - how exactly are you calculating those md5's? Cheers Mark
On 2017-05-27 01:35, Mark Kirkwood wrote: > On 26/05/17 20:09, Erik Rijkers wrote: >> >> this whole thing 100x > > Some questions that might help me get it right: > - do you think we need to stop and start the instances every time? > - do we need to init pgbench each time? > - could we just drop the subscription and publication and truncate the > replica tables instead? I have done all that in earler versions. I deliberately added these 'complications' in view of the intractability of the problem: my fear is that an earlier failure leaves some half-failed state behind in an instance, which then might cause more failure. This would undermine the intent of the whole exercise (which is to count succes/failure rate). So it is important to be as sure as possible that each cycle starts out as cleanly as possible. > - what scale pgbench are you running? I use a small script to call the main script; at the moment it does something like: ------------------- duration=60 from=1 to=100 for scale in 25 5 do for clients in 90 64 8 do date_str=$(date +"%Y%m%d_%H%M") outfile=out_${date_str}.txt time for x in `seq $from$to` do ./pgbench_derail2.sh $scale $clients $duration $date_str [...] ------------------- > - how many clients for the 1 min pgbench run? see above > - are you starting the pgbench run while the copy_data jobs for the > subscription are still running? I assume with copy_data you mean the data sync of the original table before pgbench starts. And yes, I think here might be the origin of the problem. ( I think the problem I get is actually easily avoided by putting wait states here and there in between separate steps. But the testing idea here is to force the system into error, not to avoid any errors) > - how exactly are you calculating those md5's? Here is the bash function: cb (I forget what that stands for, I guess 'content bench'). $outf is a log file to which the program writes output: --------------------------- function cb() { # display the 4 pgbench tables' accumulated content as md5s # a,b,t,h stand for: pgbench_accounts, -branches, -tellers,-history num_tables=$( echo "select count(*) from pg_class where relkind = 'r' and relname ~ '^pgbench_'" | psql -qtAX ) if [[ $num_tables -ne 4 ]] then echo "pgbench tables not 4 - exit" >> $outf exit fi for port in $port1 $port2 do md5_a=$(echo "select * from pgbench_accounts order by aid"|psql -qtAXp $port|md5sum|cut -b 1-9) md5_b=$(echo "select * from pgbench_branches order by bid"|psql -qtAXp $port|md5sum|cut -b 1-9) md5_t=$(echo "select * from pgbench_tellers order by tid"|psql -qtAXp $port|md5sum|cut -b 1-9) md5_h=$(echo "select * from pgbench_history order by hid"|psql -qtAXp $port|md5sum|cut -b 1-9) cnt_a=$(echo "select count(*) from pgbench_accounts" |psql -qtAXp $port) cnt_b=$(echo "select count(*) from pgbench_branches" |psql -qtAXp $port) cnt_t=$(echo "select count(*) from pgbench_tellers" |psql -qtAXp $port) cnt_h=$(echo "select count(*) from pgbench_history" |psql -qtAXp $port) md5_total[$port]=$( echo "${md5_a} ${md5_b} ${md5_t} ${md5_h}" | md5sum ) printf "$port a,b,t,h: %8d %6d %6d %6d" $cnt_a $cnt_b $cnt_t $cnt_h echo -n " $md5_a $md5_b $md5_t $md5_h" if [[ $port -eq $port1 ]]; then echo " master" elif [[ $port -eq $port2 ]]; then echo -n " replica" else echo " ERROR " fi done if [[ "${md5_total[$port1]}" == "${md5_total[$port2]}"]] then echo " ok" else echo " NOK" fi } --------------------------- this enables: echo "-- getting md5 (cb)" cb_text1=$(cb) and testing that string like: if echo "$cb_text1" | grep -qw 'replica ok'; then echo "-- All is well." [...] Later today I'll try to clean up the whole thing and post it.
On 2017-05-27 01:35, Mark Kirkwood wrote: > On 26/05/17 20:09, Erik Rijkers wrote: > >> >> The idea is simple enough: >> >> startup instance1 >> startup instance2 (on same machine) >> primary: init pgbench tables >> primary: add primary key to pgbench_history >> copy empty tables to replica by dump/restore >> primary: start publication >> replica: start subscription >> primary: run 1-minute pgbench >> wait till the 4 md5's of primary pgbench tables >> are the same as the 4 md5's of replica pgbench >> tables (this will need a time-out). >> log 'ok' or 'not ok' >> primary: clean up publication >> replica: clean up subscription >> shutdown primary >> shutdown replica >> >> this whole thing 100x Here is what I have: instances.sh: starts up 2 assert enabled sessions instances_fast.sh: alternative to instances.sh starts up 2 assert disabled 'fast' sessions testset.sh loop to call pgbench_derail2.sh with varying params pgbench_derail2.sh main test program can be called 'standalone' ./pgbench_derail2.sh $scale $clients $duration $date_str so for instance this should work: ./pgbench_derail2.sh 25 64 60 20170527_1019 to remove publication and subscription from sessions, add a 5th parameter 'clean' ./pgbench_derail2.sh 1 1 1 1 'clean' pubsub.sh displays replication state. also called by pgbench_derail2.sh must be in path result.sh display results I keep this in a screen-session as: watch -n 20 './result.sh 201705' Peculiar to my setup also: server version at compile time stamped with date + commit hash I misuse information_schema.sql_packages at compile time to store patch information instances are in $pg_stuff_dir/pg_installations/pgsql.<project name> So you'll have to outcomment a line here and there, and adapt paths, ports, and things like that. It's a bit messy, I should have used perl from the beginning... Good luck :) Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017-05-27 10:30, Erik Rijkers wrote: > On 2017-05-27 01:35, Mark Kirkwood wrote: > Here is what I have: > > instances.sh: > testset.sh > pgbench_derail2.sh > pubsub.sh To be clear: ( Apart from that standalone call like ./pgbench_derail2.sh $scale $clients $duration $date_str ) I normally run by editing the parameters in testset.sh, then run: ./testset.sh that then shows a tail -F of the output-logfile (to paste into another screen). in yet another screen the 'watch -n20 results.sh' line The output=files are the .txt files. The logfiles of the instances are (at the end of each test) copied to directory logfiles/ under a meaningful name that shows the parameters, and with an extension like '.ok.log' or '.NOK.log'. I am very curious at your results.
On 27 May 2017 at 09:44, Erik Rijkers <er@xs4all.nl> wrote: > I am very curious at your results. We take your bug report on good faith, but we still haven't seen details of the problem or how to recreate it. Please post some details. Thanks. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On May 27, 2017 6:13:19 AM EDT, Simon Riggs <simon@2ndquadrant.com> wrote: >On 27 May 2017 at 09:44, Erik Rijkers <er@xs4all.nl> wrote: > >> I am very curious at your results. > >We take your bug report on good faith, but we still haven't seen >details of the problem or how to recreate it. > >Please post some details. Thanks. ? -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 2017-05-27 17:11, Andres Freund wrote: > On May 27, 2017 6:13:19 AM EDT, Simon Riggs <simon@2ndquadrant.com> > wrote: >> On 27 May 2017 at 09:44, Erik Rijkers <er@xs4all.nl> wrote: >> >>> I am very curious at your results. >> >> We take your bug report on good faith, but we still haven't seen >> details of the problem or how to recreate it. >> >> Please post some details. Thanks. > > ? ok, ok... ( The thing is, I am trying to pre-digest the output but it takes time ) I can do this now: attached some output that belongs with this group of 100 1-minute runs: -- out_20170525_1426.txt 100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n -- scale 25 82 -- All is well. 18 -- Not good. That is the worst set of runs of what I showed earlier. that is: out_20170525_1426.txt and 2x18 logfiles that the 18 failed runs produced. Those logfiles have names like: logrep.20170525_1426.1436.1.scale_25.clients_64.NOK.log logrep.20170525_1426.1436.2.scale_25.clients_64.NOK.log .1.=primary .2.=replica Please disregard the errors around pg_current_wal_location(). (it was caused by some code to dump some wal into zipfiles which obviously stopped working after the function was removed/renamed) There are also some uninportant errors from the test-harness where I call with the wrong port. Not interesting, I don't think. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Interesting - might be good to see your test script too (so we can better understand how you are deciding if the runs are successful or not). Also, any idea which rows are different? If you want something out of the box that will do that for you see DBIx::Compare. regards Mark On 28/05/17 04:12, Erik Rijkers wrote: > > ok, ok... > > ( The thing is, I am trying to pre-digest the output but it takes time ) > > I can do this now: attached some output that belongs with this group > of 100 1-minute runs: > > -- out_20170525_1426.txt > 100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n -- scale 25 > 82 -- All is well. > 18 -- Not good. > > That is the worst set of runs of what I showed earlier. > > that is: out_20170525_1426.txt and > 2x18 logfiles that the 18 failed runs produced. > Those logfiles have names like: > logrep.20170525_1426.1436.1.scale_25.clients_64.NOK.log > logrep.20170525_1426.1436.2.scale_25.clients_64.NOK.log > > .1.=primary > .2.=replica > > Please disregard the errors around pg_current_wal_location(). (it was > caused by some code to dump some wal into zipfiles which obviously > stopped working after the function was removed/renamed) There are also > some uninportant errors from the test-harness where I call with the > wrong port. Not interesting, I don't think. > >
Sorry - I see you have done this already. On 28/05/17 11:15, Mark Kirkwood wrote: > Interesting - might be good to see your test script too (so we can > better understand how you are deciding if the runs are successful or > not). >
On 2017-05-28 01:21, Mark Kirkwood wrote: > Sorry - I see you have done this already. > > On 28/05/17 11:15, Mark Kirkwood wrote: >> Interesting - might be good to see your test script too (so we can >> better understand how you are deciding if the runs are successful or >> not). Yes, in pgbench_derail2.sh in the cb function it says: if [[ "${md5_total[$port1]}" == "${md5_total[$port2]}" ]] then echo " ok" else echo " NOK" fi This is the final decision about success ('ok') or failure ('NOK'). (NOK stands for 'Not OK') The two compared md5's (on the two ports: primary and replica) are each taken over a concatenation of the 4 separate md5's of the table-content (taken earlier in cb()). If one or more of the 4 md5's differs, then that concatation-md5 will differ too. Sorry, there is not a lot of comment
On 2017-05-28 01:15, Mark Kirkwood wrote: > Also, any idea which rows are different? If you want something out of > the box that will do that for you see DBIx::Compare. I used to save the content-diffs too but in the end decided they were useless (to me, anyway).
On 27/05/17 20:30, Erik Rijkers wrote: > > > Here is what I have: > > instances.sh: > starts up 2 assert enabled sessions > > instances_fast.sh: > alternative to instances.sh > starts up 2 assert disabled 'fast' sessions > > testset.sh > loop to call pgbench_derail2.sh with varying params > > pgbench_derail2.sh > main test program > can be called 'standalone' > ./pgbench_derail2.sh $scale $clients $duration $date_str > so for instance this should work: > ./pgbench_derail2.sh 25 64 60 20170527_1019 > to remove publication and subscription from sessions, add a 5th > parameter 'clean' > ./pgbench_derail2.sh 1 1 1 1 'clean' > > pubsub.sh > displays replication state. also called by pgbench_derail2.sh > must be in path > > result.sh > display results > I keep this in a screen-session as: > watch -n 20 './result.sh 201705' > > > Peculiar to my setup also: > server version at compile time stamped with date + commit hash > I misuse information_schema.sql_packages at compile time to store > patch information > instances are in $pg_stuff_dir/pg_installations/pgsql.<project name> > > So you'll have to outcomment a line here and there, and adapt paths, > ports, and things like that. > > It's a bit messy, I should have used perl from the beginning... > Considering it is all shell - pretty nice! I spent a bit of time today getting this working in a vanilla Ubuntu 16.04 cloud server. I found a few things that didn't work (suspect Erik has some default env variables set for ports and databases). These were sufficient to stop logical replication working for me at all - due to no dbname specified in the subscription connection. Given I had to make some changes anyway, I moved all the config into one place (new file config.sh) - made all the programs use /bin/bash as interpreter (/bin/sh just does not work for scripts on Ubuntu), added ports and databases as reqd and fixed the need to mess too much with PATH (see attached diff). So running in cloud land now...so for no errors - will update. regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017-05-26 15:59, Petr Jelinek wrote: > Hmm, I was under the impression that the changes we proposed in the > snapbuild thread fixed your issues, does this mean they didn't? Or the > modified versions of those that were eventually committed didn't? Or > did > issues reappear at some point? Here is a bit of info: Just now (using Mark Kirkwood's version of my test) I had a session logging this: unknown relation state "w" which I had never seen before. This is column srsubstate in pg_subscription_rel. That session completed successfully ('replica ok'), so it's not necessarily a problem. grepping through my earlier logs (of weeks of intermittent test-runs), I found only one more (timestamp 20170525_0125). Here it occurred in a failed session. No idea what it means. At the very least this value 'w' is missing from the documentation, which only mentions: i = initalize d = data copy s = synchronized r = (normal replication) Erik Rijkers
On 28/05/17 19:01, Mark Kirkwood wrote: > > So running in cloud land now...so for no errors - will update. > > > The framework ran 600 tests last night, and I see 3 'NOK' results, i.e 3 failed test runs (all scale 25 and 8 pgbench clients). Given the way the test decides on failure (gets tired of waiting for the table md5's to match) - it begs the question 'What if it had waited a bit longer'? However from what I can see in all cases: - the rowcounts were the same in master and replica - the md5 of pgbench_accounts was different ...so does seem possible that there is some bug being tickled here. Unfortunately the test framework blasts away the failed tables and subscription and continues on...I'm going to amend it to stop on failure so I can have a closer look at what happened. regards Mark
On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
On 28/05/17 19:01, Mark Kirkwood wrote:
So running in cloud land now...so for no errors - will update.
The framework ran 600 tests last night, and I see 3 'NOK' results, i.e 3 failed test runs (all scale 25 and 8 pgbench clients). Given the way the test decides on failure (gets tired of waiting for the table md5's to match) - it begs the question 'What if it had waited a bit longer'? However from what I can see in all cases:
- the rowcounts were the same in master and replica
- the md5 of pgbench_accounts was different
All four tables should be wrong if there is still a transaction it is waiting for, as all the changes happen in a single transaction.
I also got a failure, after 87 iterations of a similar test case. It waited for hours, as mine requires manual intervention to stop waiting. On the subscriber, one account still had a zero balance, while the history table on the subscriber agreed with both history and accounts on the publisher and the account should not have been zero, so definitely a transaction atomicity got busted.
I altered the script to also save the tellers and branches tables and repeated the runs, but so far it hasn't failed again in over 800 iterations using the altered script.
...so does seem possible that there is some bug being tickled here. Unfortunately the test framework blasts away the failed tables and subscription and continues on...I'm going to amend it to stop on failure so I can have a closer look at what happened.
What would you want to look at? Would saving the WAL from the master be helpful?
Cheers,
Jeff
On 2017-05-29 00:17, Mark Kirkwood wrote: > On 28/05/17 19:01, Mark Kirkwood wrote: > >> So running in cloud land now...so for no errors - will update. > > The framework ran 600 tests last night, and I see 3 'NOK' results, i.e > 3 failed test runs (all scale 25 and 8 pgbench clients). Given the way Could you also give the params for the successful runs? Can you say anything about hardware? (My experience is that older, slower, 'worse' hardware makes for more fails.) Many thanks, by the way. I'm glad that it turns out I'm probably not doing something uniquely stupid (although I'm not glad that there seems to be a bug, and an elusive one at that) Erik Rijkers
On 2017-05-29 03:33, Jeff Janes wrote: > On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood < > mark.kirkwood@catalyst.net.nz> wrote: > > I also got a failure, after 87 iterations of a similar test case. It [...] > repeated the runs, but so far it hasn't failed again in over 800 > iterations Could you give the params for the successful runs? (ideally, a grep | sort | uniq -c of the ran pgbench lines ) Can you say anything about hardware? Thanks for repeating my lengthy tests. Erik Rijkers
On 29/05/17 16:26, Erik Rijkers wrote: > On 2017-05-29 00:17, Mark Kirkwood wrote: >> On 28/05/17 19:01, Mark Kirkwood wrote: >> >>> So running in cloud land now...so for no errors - will update. >> >> The framework ran 600 tests last night, and I see 3 'NOK' results, i.e >> 3 failed test runs (all scale 25 and 8 pgbench clients). Given the way > > Could you also give the params for the successful runs? Scale 25, clients 90 and 64, scale 5 clients 90, 64 and 8 (i.e the defaults in the driver script). > > Can you say anything about hardware? (My experience is that older, > slower, 'worse' hardware makes for more fails.) It's running in a openstack cloud (so is a libvirt guest): 4 cpus, 4 GB ram and 2 disks: one for each Postgres instance, both formatted xfs. Hmm so maybe I should run a VM on my workstation and crank the IOPS limit way down...in the meantime I'll just let it run :-) > > > Many thanks, by the way. I'm glad that it turns out I'm probably not > doing something uniquely stupid (although I'm not glad that there > seems to be a bug, and an elusive one at that) > > Yeah looks like something subtle :-( Hopefully now its out in the open we'll all figure it together! regards Mark
On 29/05/17 13:33, Jeff Janes wrote: > On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>> > wrote: > > On 28/05/17 19:01, Mark Kirkwood wrote: > > > So running in cloud land now...so for no errors - will update. > > > > > The framework ran 600 tests last night, and I see 3 'NOK' results, > i.e 3 failed test runs (all scale 25 and 8 pgbench clients). Given > the way the test decides on failure (gets tired of waiting for the > table md5's to match) - it begs the question 'What if it had > waited a bit longer'? However from what I can see in all cases: > > - the rowcounts were the same in master and replica > - the md5 of pgbench_accounts was different > > > All four tables should be wrong if there is still a transaction it is > waiting for, as all the changes happen in a single transaction. > Yeah, my thought exactly. > I also got a failure, after 87 iterations of a similar test case. It > waited for hours, as mine requires manual intervention to stop > waiting. On the subscriber, one account still had a zero balance, > while the history table on the subscriber agreed with both history and > accounts on the publisher and the account should not have been zero, > so definitely a transaction atomicity got busted. > Interesting, great that we are seeing the same thing. > I altered the script to also save the tellers and branches tables and > repeated the runs, but so far it hasn't failed again in over 800 > iterations using the altered script. > > > ...so does seem possible that there is some bug being tickled > here. Unfortunately the test framework blasts away the failed > tables and subscription and continues on...I'm going to amend it > to stop on failure so I can have a closer look at what happened. > > > What would you want to look at? Would saving the WAL from the master > be helpful? > > Good question - I initially wanted to see if anything changed if I waited longer (which you have already figured out) and what was actually wrong with the accounts record (which you have gotten to as well)! Nice. After that, I'd thought of doing another transaction on an accounts record that lives in the same page as the 'gammy' one on the master - generally poke about and see what is happening :-) regards Mark
On 29/05/17 03:33, Jeff Janes wrote: > On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>> > wrote: > > The framework ran 600 tests last night, and I see 3 'NOK' results, > i.e 3 failed test runs (all scale 25 and 8 pgbench clients). Given > the way the test decides on failure (gets tired of waiting for the > table md5's to match) - it begs the question 'What if it had waited > a bit longer'? However from what I can see in all cases: > > - the rowcounts were the same in master and replica > - the md5 of pgbench_accounts was different > > > All four tables should be wrong if there is still a transaction it is > waiting for, as all the changes happen in a single transaction. Not necessarily, if the bug is in the sync worker or in the sync to apply handover code (which is one of the more complicated parts of all of the logical replication, so it's prime candidate) then it can easily be just one table. > I also got a failure, after 87 iterations of a similar test case. It > waited for hours, as mine requires manual intervention to stop waiting. > On the subscriber, one account still had a zero balance, while the > history table on the subscriber agreed with both history and accounts on > the publisher and the account should not have been zero, so definitely a > transaction atomicity got busted. I am glad others are able to reproduce this, my machine is still at 0 failures after 800 cycles. > > What would you want to look at? Would saving the WAL from the master be > helpful? > Useful info is, logs from provider (mainly the logical decoding logs that mention LSNs), logs from subscriber (the lines about when sync workers finished), contents of the pg_subscription_rel (with srrelid casted to regclass so we know which table is which), and pg_waldump output around the LSNs found in the logs and in the pg_subscription_rel (+ few lines before and some after to get context). It's enough to only care about LSNs for the table(s) that are out of sync. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 29/05/17 23:14, Petr Jelinek wrote: > On 29/05/17 03:33, Jeff Janes wrote: > >> What would you want to look at? Would saving the WAL from the master be >> helpful? >> > Useful info is, logs from provider (mainly the logical decoding logs > that mention LSNs), logs from subscriber (the lines about when sync > workers finished), contents of the pg_subscription_rel (with srrelid > casted to regclass so we know which table is which), and pg_waldump > output around the LSNs found in the logs and in the pg_subscription_rel > (+ few lines before and some after to get context). It's enough to only > care about LSNs for the table(s) that are out of sync. > I have a run that aborted with failure (accounts table md5 mismatch). Petr - would you like to have access to the machine ? If so send me you public key and I'll set it up. Cheers Mark
On 2017-05-26 08:10, Erik Rijkers wrote: > If you run a pgbench session of 1 minute over a logical replication > connection and repeat that 100x this is what you get: > > At clients 90, 64, 8, scale 25: > -- out_20170525_0944.txt > 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25 > 7 -- Not good. > -- out_20170525_1426.txt > 100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n -- scale 25 > 18 -- Not good. > -- out_20170525_2049.txt > 100 -- pgbench -c 8 -j 8 -T 60 -P 12 -n -- scale 25 > 10 -- Not good. > At clients 90, 64, 8, scale 5: > -- out_20170526_0126.txt > 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 5 > 2 -- Not good. > -- out_20170526_0352.txt > 100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n -- scale 5 > 3 -- Not good. > -- out_20170526_0621.txt > 100 -- pgbench -c 8 -j 8 -T 60 -P 12 -n -- scale 5 > 4 -- Not good. It seems this problem is a bit less serious than it did look to me (as others find lower numbers of fail). Still, how is its seriousness graded by now? Is it a show-stopper? Should it go onto the Open Items page? Is anyone still looking into it? thanks, Erik Rijkers > The above installations (master+replica) are with Petr Jelinek's (and > Michael Paquier's) last patches > 0001-Fix-signal-handling-in-logical-workers.patch > 0002-Make-tablesync-worker-exit-when-apply-dies-while-it-.patch > 0003-Receive-invalidation-messages-correctly-in-tablesync.patch > Remove-the-SKIP-REFRESH-syntax-suggar-in-ALTER-SUBSC-v2.patch
On 29/05/17 23:06, Mark Kirkwood wrote: > On 29/05/17 23:14, Petr Jelinek wrote: > >> On 29/05/17 03:33, Jeff Janes wrote: >> >>> What would you want to look at? Would saving the WAL from the master be >>> helpful? >>> >> Useful info is, logs from provider (mainly the logical decoding logs >> that mention LSNs), logs from subscriber (the lines about when sync >> workers finished), contents of the pg_subscription_rel (with srrelid >> casted to regclass so we know which table is which), and pg_waldump >> output around the LSNs found in the logs and in the pg_subscription_rel >> (+ few lines before and some after to get context). It's enough to only >> care about LSNs for the table(s) that are out of sync. >> > > I have a run that aborted with failure (accounts table md5 mismatch). > Petr - would you like to have access to the machine ? If so send me you > public key and I'll set it up. Thanks to Mark's offer I was able to study the issue as it happened and found the cause of this. The busy loop in apply stops at the point when worker shmem state indicates that table synchronization was finished, but that might not be visible in the next transaction if it takes long to flush the final commit to disk so we might ignore couple of transactions for given table in the main apply because we think it's still being synchronized. This also explains why I could not reproduce it on my testing machine (fast ssd disk array where flushes were always fast) and why it happens relatively rarely because it's one specific commit during the whole synchronization process that needs to be slow. So as solution I changed the busy loop in the apply to wait for in-catalog status rather than in-memory status to make sure things are really there and flushed. While working on this I realized that the handover itself is bit more complex than necessary (especially for debugging and for other people understanding it) so I did some small changes as part of this patch to make the sequences of states table goes during synchronization process to always be the same. This might cause unnecessary update per one table synchronization process in some cases but that seems like small enough price to pay for clearer logic. And it also fixes another potential bug that I identified where we might write wrong state to catalog if main apply crashed while sync worker was waiting for status update. I've been running tests on this overnight on another machine where I was able to reproduce the original issue within few runs (once I found what causes it) and so far looks good. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017-05-31 11:16, Petr Jelinek wrote: [...] > Thanks to Mark's offer I was able to study the issue as it happened and > found the cause of this. > > [0001-Improve-handover-logic-between-sync-and-apply-worker.patch] This looks good: -- out_20170531_1141.txt 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25 100 -- All is well. So this is 100x a 1-minute test with 100x success. (This on the most fastidious machine (slow disks, meagre specs) that used to give 15% failures) I'll let it run for a couple of days with varying params (and on varying hardware) but it definitely does look as if you fixed it. Thanks! Erik Rijkers
Re: [HACKERS] logical replication - still unstable after all thesemonths
From
Peter Eisentraut
Date:
On 5/31/17 05:16, Petr Jelinek wrote: > I've been running tests on this overnight on another machine where I was > able to reproduce the original issue within few runs (once I found what > causes it) and so far looks good. I'll give people another day or so to test this before committing. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 31/05/17 21:16, Petr Jelinek wrote: > On 29/05/17 23:06, Mark Kirkwood wrote: >> On 29/05/17 23:14, Petr Jelinek wrote: >> >>> On 29/05/17 03:33, Jeff Janes wrote: >>> >>>> What would you want to look at? Would saving the WAL from the master be >>>> helpful? >>>> >>> Useful info is, logs from provider (mainly the logical decoding logs >>> that mention LSNs), logs from subscriber (the lines about when sync >>> workers finished), contents of the pg_subscription_rel (with srrelid >>> casted to regclass so we know which table is which), and pg_waldump >>> output around the LSNs found in the logs and in the pg_subscription_rel >>> (+ few lines before and some after to get context). It's enough to only >>> care about LSNs for the table(s) that are out of sync. >>> >> I have a run that aborted with failure (accounts table md5 mismatch). >> Petr - would you like to have access to the machine ? If so send me you >> public key and I'll set it up. > Thanks to Mark's offer I was able to study the issue as it happened and > found the cause of this. > > The busy loop in apply stops at the point when worker shmem state > indicates that table synchronization was finished, but that might not be > visible in the next transaction if it takes long to flush the final > commit to disk so we might ignore couple of transactions for given table > in the main apply because we think it's still being synchronized. This > also explains why I could not reproduce it on my testing machine (fast > ssd disk array where flushes were always fast) and why it happens > relatively rarely because it's one specific commit during the whole > synchronization process that needs to be slow. > > So as solution I changed the busy loop in the apply to wait for > in-catalog status rather than in-memory status to make sure things are > really there and flushed. > > While working on this I realized that the handover itself is bit more > complex than necessary (especially for debugging and for other people > understanding it) so I did some small changes as part of this patch to > make the sequences of states table goes during synchronization process > to always be the same. This might cause unnecessary update per one table > synchronization process in some cases but that seems like small enough > price to pay for clearer logic. And it also fixes another potential bug > that I identified where we might write wrong state to catalog if main > apply crashed while sync worker was waiting for status update. > > I've been running tests on this overnight on another machine where I was > able to reproduce the original issue within few runs (once I found what > causes it) and so far looks good. > > > I'm seeing a new failure with the patch applied - this time the history table has missing rows. Petr, I'll put back your access :-) regards Mark
On 2017-06-02 00:46, Mark Kirkwood wrote: > On 31/05/17 21:16, Petr Jelinek wrote: > > I'm seeing a new failure with the patch applied - this time the > history table has missing rows. Petr, I'll put back your access :-) Is this error during 1-minute runs? I'm asking because I've moved back to longer (1-hour) runs (no errors so far), and I'd like to keep track of what the most 'vulnerable' parameters are. thanks, Erik Rijkers
On 02/06/17 17:11, Erik Rijkers wrote: > On 2017-06-02 00:46, Mark Kirkwood wrote: >> On 31/05/17 21:16, Petr Jelinek wrote: >> >> I'm seeing a new failure with the patch applied - this time the >> history table has missing rows. Petr, I'll put back your access :-) > > Is this error during 1-minute runs? > > I'm asking because I've moved back to longer (1-hour) runs (no errors > so far), and I'd like to keep track of what the most 'vulnerable' > parameters are. > Yeah, still using your test config (with my minor modifications). When I got the error the 1st time, I did a complete make clean and rebuild....but it is still possible I've 'done it wrong' - so independent confirmation would be good! Cheers Mark
On 02/06/17 08:55, Mark Kirkwood wrote: > On 02/06/17 17:11, Erik Rijkers wrote: > >> On 2017-06-02 00:46, Mark Kirkwood wrote: >>> On 31/05/17 21:16, Petr Jelinek wrote: >>> >>> I'm seeing a new failure with the patch applied - this time the >>> history table has missing rows. Petr, I'll put back your access :-) >> >> Is this error during 1-minute runs? >> >> I'm asking because I've moved back to longer (1-hour) runs (no errors >> so far), and I'd like to keep track of what the most 'vulnerable' >> parameters are. >> > > Yeah, still using your test config (with my minor modifications). > > When I got the error the 1st time, I did a complete make clean and > rebuild....but it is still possible I've 'done it wrong' - so > independent confirmation would be good! Well, I've seen this issue as well while I was developing the fix, but the patch I proposed fixed it for me as well as the original issue. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 02/06/17 22:29, Petr Jelinek wrote: > On 02/06/17 08:55, Mark Kirkwood wrote: >> On 02/06/17 17:11, Erik Rijkers wrote: >> >>> On 2017-06-02 00:46, Mark Kirkwood wrote: >>>> On 31/05/17 21:16, Petr Jelinek wrote: >>>> >>>> I'm seeing a new failure with the patch applied - this time the >>>> history table has missing rows. Petr, I'll put back your access :-) >>> >>> Is this error during 1-minute runs? >>> >>> I'm asking because I've moved back to longer (1-hour) runs (no errors >>> so far), and I'd like to keep track of what the most 'vulnerable' >>> parameters are. >>> >> >> Yeah, still using your test config (with my minor modifications). >> >> When I got the error the 1st time, I did a complete make clean and >> rebuild....but it is still possible I've 'done it wrong' - so >> independent confirmation would be good! > > Well, I've seen this issue as well while I was developing the fix, but > the patch I proposed fixed it for me as well as the original issue. > While I was testing something for different thread I noticed that I manage transactions incorrectly in this patch. Fixed here, I didn't test it much yet (it takes a while as you know :) ). Not sure if it's related to the issue you've seen though. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 03/06/17 11:10, Petr Jelinek wrote: > On 02/06/17 22:29, Petr Jelinek wrote: >> On 02/06/17 08:55, Mark Kirkwood wrote: >>> On 02/06/17 17:11, Erik Rijkers wrote: >>> >>>> On 2017-06-02 00:46, Mark Kirkwood wrote: >>>>> On 31/05/17 21:16, Petr Jelinek wrote: >>>>> >>>>> I'm seeing a new failure with the patch applied - this time the >>>>> history table has missing rows. Petr, I'll put back your access :-) >>>> Is this error during 1-minute runs? >>>> >>>> I'm asking because I've moved back to longer (1-hour) runs (no errors >>>> so far), and I'd like to keep track of what the most 'vulnerable' >>>> parameters are. >>>> >>> Yeah, still using your test config (with my minor modifications). >>> >>> When I got the error the 1st time, I did a complete make clean and >>> rebuild....but it is still possible I've 'done it wrong' - so >>> independent confirmation would be good! >> Well, I've seen this issue as well while I was developing the fix, but >> the patch I proposed fixed it for me as well as the original issue. >> > While I was testing something for different thread I noticed that I > manage transactions incorrectly in this patch. Fixed here, I didn't test > it much yet (it takes a while as you know :) ). Not sure if it's related > to the issue you've seen though. > > Ok - I've applied this version, and running tests again. I needed to do a git pull to apply the patch, so getting some other changes too! Cheers Mark
On 03/06/17 04:45, Mark Kirkwood wrote: > On 03/06/17 11:10, Petr Jelinek wrote: > >> On 02/06/17 22:29, Petr Jelinek wrote: >>> On 02/06/17 08:55, Mark Kirkwood wrote: >>>> On 02/06/17 17:11, Erik Rijkers wrote: >>>> >>>>> On 2017-06-02 00:46, Mark Kirkwood wrote: >>>>>> On 31/05/17 21:16, Petr Jelinek wrote: >>>>>> >>>>>> I'm seeing a new failure with the patch applied - this time the >>>>>> history table has missing rows. Petr, I'll put back your access :-) >>>>> Is this error during 1-minute runs? >>>>> >>>>> I'm asking because I've moved back to longer (1-hour) runs (no errors >>>>> so far), and I'd like to keep track of what the most 'vulnerable' >>>>> parameters are. >>>>> >>>> Yeah, still using your test config (with my minor modifications). >>>> >>>> When I got the error the 1st time, I did a complete make clean and >>>> rebuild....but it is still possible I've 'done it wrong' - so >>>> independent confirmation would be good! >>> Well, I've seen this issue as well while I was developing the fix, but >>> the patch I proposed fixed it for me as well as the original issue. >>> >> While I was testing something for different thread I noticed that I >> manage transactions incorrectly in this patch. Fixed here, I didn't test >> it much yet (it takes a while as you know :) ). Not sure if it's related >> to the issue you've seen though. >> >> > Ok - I've applied this version, and running tests again. I needed to do > a git pull to apply the patch, so getting some other changes too! > Thanks, yes, I forgot to mention that I rebased it against the current HEAD as well. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 2, 2017 at 4:10 PM, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
While I was testing something for different thread I noticed that I
manage transactions incorrectly in this patch. Fixed here, I didn't test
it much yet (it takes a while as you know :) ). Not sure if it's related
to the issue you've seen though.
This conflicts again with Peter E's recent commit 3c9bc2157a4f465b3c070d1250597568d2dc285f
Thanks,
Jeff
On 2017-05-31 16:20, Erik Rijkers wrote: > On 2017-05-31 11:16, Petr Jelinek wrote: > [...] >> Thanks to Mark's offer I was able to study the issue as it happened >> and >> found the cause of this. >> >> [0001-Improve-handover-logic-between-sync-and-apply-worker.patch] > > This looks good: > > -- out_20170531_1141.txt > 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25 > 100 -- All is well. > > So this is 100x a 1-minute test with 100x success. (This on the most > fastidious machine (slow disks, meagre specs) that used to give 15% > failures) [Improve-handover-logic-between-sync-and-apply-worker-v2.patch] No errors after (several days of) running variants of this. (2500x 1 minute runs; 12x 1-hour runs) > > Thanks! > > Erik Rijkers
On 05/06/17 00:04, Erik Rijkers wrote: > On 2017-05-31 16:20, Erik Rijkers wrote: >> On 2017-05-31 11:16, Petr Jelinek wrote: >> [...] >>> Thanks to Mark's offer I was able to study the issue as it happened and >>> found the cause of this. >>> >>> [0001-Improve-handover-logic-between-sync-and-apply-worker.patch] >> >> This looks good: >> >> -- out_20170531_1141.txt >> 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25 >> 100 -- All is well. >> >> So this is 100x a 1-minute test with 100x success. (This on the most >> fastidious machine (slow disks, meagre specs) that used to give 15% >> failures) > > [Improve-handover-logic-between-sync-and-apply-worker-v2.patch] > > No errors after (several days of) running variants of this. (2500x 1 > minute runs; 12x 1-hour runs) Same here, no errors with the v2 patch applied (approx 2 days - all 1 minute runs) regards Mark
On 05/06/17 13:08, Mark Kirkwood wrote: > On 05/06/17 00:04, Erik Rijkers wrote: > >> On 2017-05-31 16:20, Erik Rijkers wrote: >>> On 2017-05-31 11:16, Petr Jelinek wrote: >>> [...] >>>> Thanks to Mark's offer I was able to study the issue as it happened >>>> and >>>> found the cause of this. >>>> >>>> [0001-Improve-handover-logic-between-sync-and-apply-worker.patch] >>> >>> This looks good: >>> >>> -- out_20170531_1141.txt >>> 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25 >>> 100 -- All is well. >>> >>> So this is 100x a 1-minute test with 100x success. (This on the most >>> fastidious machine (slow disks, meagre specs) that used to give 15% >>> failures) >> >> [Improve-handover-logic-between-sync-and-apply-worker-v2.patch] >> >> No errors after (several days of) running variants of this. (2500x 1 >> minute runs; 12x 1-hour runs) > > Same here, no errors with the v2 patch applied (approx 2 days - all 1 > minute runs) > Further, reapplying the v1 patch (with a bit of editing as I wanted to apply it to my current master), gets a failure with missing rows in the history table quite quickly. I'll put back the v2 patch and resume runs with that, but I'm cautiously optimistic that the v2 patch solves the issue. regards Mark
On 03/06/17 16:12, Jeff Janes wrote: > > On Fri, Jun 2, 2017 at 4:10 PM, Petr Jelinek > <petr.jelinek@2ndquadrant.com <mailto:petr.jelinek@2ndquadrant.com>> wrote: > > > While I was testing something for different thread I noticed that I > manage transactions incorrectly in this patch. Fixed here, I didn't test > it much yet (it takes a while as you know :) ). Not sure if it's related > to the issue you've seen though. > > > This conflicts again with Peter E's recent commit > 3c9bc2157a4f465b3c070d1250597568d2dc285f > Rebased on top of current HEAD. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] logical replication - still unstable after all thesemonths
From
Peter Eisentraut
Date:
On 6/4/17 22:38, Petr Jelinek wrote: > On 03/06/17 16:12, Jeff Janes wrote: >> >> On Fri, Jun 2, 2017 at 4:10 PM, Petr Jelinek >> <petr.jelinek@2ndquadrant.com <mailto:petr.jelinek@2ndquadrant.com>> wrote: >> >> >> While I was testing something for different thread I noticed that I >> manage transactions incorrectly in this patch. Fixed here, I didn't test >> it much yet (it takes a while as you know :) ). Not sure if it's related >> to the issue you've seen though. >> >> >> This conflicts again with Peter E's recent commit >> 3c9bc2157a4f465b3c070d1250597568d2dc285f >> > > Rebased on top of current HEAD. Committed that, with some further updates of comments to reflect the changes. I do like the simplification of the state progression. Perhaps it could be simplified even further, by eliminating the SYNCDONE setting in LogicalRepSyncTableStart() and making it go through the apply loop and end up in process_syncing_tables_for_sync() in all cases. Which is kind of what the comments at the top of the file suggest would happen anyway. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-06-06 20:53, Peter Eisentraut wrote: > On 6/4/17 22:38, Petr Jelinek wrote: > > Committed that, with some further updates of comments to reflect the Belated apologies all round for the somewhat provocative $subject; but I felt at that moment that this item needed some extra attention. I don't know if it worked but I'm glad that it is solved ;) Thanks, Erik Rijkers
On Tue, Jun 6, 2017 at 3:01 PM, Erik Rijkers <er@xs4all.nl> wrote: > Belated apologies all round for the somewhat provocative $subject; but I > felt at that moment that this item needed some extra attention. FWIW, it seemed like a pretty fair subject line to me given your test results. I think it's in everyone's interest to get this feature stabilized before we ship a final release - people will rely on it, and if it drops even one row even occasionally, it's going to be a big problem for our users. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 06/06/17 21:09, Robert Haas wrote: > On Tue, Jun 6, 2017 at 3:01 PM, Erik Rijkers <er@xs4all.nl> wrote: >> Belated apologies all round for the somewhat provocative $subject; but I >> felt at that moment that this item needed some extra attention. > > FWIW, it seemed like a pretty fair subject line to me given your test > results. I think it's in everyone's interest to get this feature > stabilized before we ship a final release - people will rely on it, > and if it drops even one row even occasionally, it's going to be a big > problem for our users. > +1, I considered the issues solved when snapshot builder issues were fixed so it was good reminder to check again. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2017-05-28 12:44, Erik Rijkers wrote: re: srsubstate in pg_subscription_rel: > No idea what it means. At the very least this value 'w' is missing > from the documentation, which only mentions: > i = initalize > d = data copy > s = synchronized > r = (normal replication) Shouldn't we add this to that table (51.53) in the documentation? After all, the value 'w' does show up when you monitor pg_subscription_rel.
Re: [HACKERS] logical replication - still unstable after all thesemonths
From
Peter Eisentraut
Date:
On 6/17/17 06:48, Erik Rijkers wrote: > On 2017-05-28 12:44, Erik Rijkers wrote: > > re: srsubstate in pg_subscription_rel: > >> No idea what it means. At the very least this value 'w' is missing >> from the documentation, which only mentions: >> i = initalize >> d = data copy >> s = synchronized >> r = (normal replication) > > Shouldn't we add this to that table (51.53) in the documentation? > > After all, the value 'w' does show up when you monitor > pg_subscription_rel. It's not supposed to. Have you seen it after e3a815d2faa5be28551e71d5db44fb2c78133433? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-06-18 00:27, Peter Eisentraut wrote: > On 6/17/17 06:48, Erik Rijkers wrote: >> On 2017-05-28 12:44, Erik Rijkers wrote: >> >> re: srsubstate in pg_subscription_rel: >> >>> No idea what it means. At the very least this value 'w' is missing >>> from the documentation, which only mentions: >>> i = initalize >>> d = data copy >>> s = synchronized >>> r = (normal replication) >> >> Shouldn't we add this to that table (51.53) in the documentation? >> >> After all, the value 'w' does show up when you monitor >> pg_subscription_rel. > > It's not supposed to. Have you seen it after > e3a815d2faa5be28551e71d5db44fb2c78133433? Ah no, I haven't seen that 'w'-value after that (and 1000s of tests ran without error since then). I just hadn't realized that that w-value I had reported was indeed a erroneous state. thanks, this is OK then. Erik Rijkers