Re: [HACKERS] snapbuild woes - Mailing list pgsql-hackers

From Erik Rijkers
Subject Re: [HACKERS] snapbuild woes
Date
Msg-id 076a3e4c3d4fef091522d3cd7e50c7eb@xs4all.nl
Whole thread Raw
In response to Re: [HACKERS] snapbuild woes  (Andres Freund <andres@anarazel.de>)
Responses Re: [HACKERS] snapbuild woes  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
List pgsql-hackers
On 2017-05-05 02:00, Andres Freund wrote:
> 
> Could you have a look?

Running tests with these three patches:

> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
> fix-statistics-reporting-in-logical-replication-work.patch    (on top of 44c528810)

I test by 15-minute pgbench runs while there is a logical replication 
connection. Primary and replica are on the same machine.

I have seen errors on 3 different machines (where error means: at least 
1 of the 4 pgbench tables is not md5-equal). It seems better, faster 
machines yield less errors.

Normally I see in pg_stat_replication (on master) one process in state 
'streaming'.
 pid  |     wal     | replay_loc  |   diff   |   state   |   app   | 
sync_state
16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 | 
async

Often there are another two processes in pg_stat_replication that remain 
in state 'startup'.

In the failing sessions the 'streaming'-state process is missing; in 
failing sessions there are only the two processes that are and remain in 
'startup'.

FWIW, below  the output of a succesful and a failed run:


------------------ successful run:
creating tables...
1590400 of 2500000 tuples (63%) done (elapsed 5.34 s, remaining 3.05 s)
2500000 of 2500000 tuples (100%) done (elapsed 9.63 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
create publication pub1 for all tables;
create subscription sub1 connection 'port=6972 application_name=derail2' 
publication pub1 with (disabled);
alter subscription sub1 enable;
-- pgbench -c 90 -j 8 -T 900 -P 180 -n   --  scale 25
progress: 180.0 s, 82.5 tps, lat 1086.845 ms stddev 3211.785
progress: 360.0 s, 25.4 tps, lat 3469.040 ms stddev 6297.440
progress: 540.0 s, 28.9 tps, lat 3131.438 ms stddev 4288.130
progress: 720.0 s, 27.5 tps, lat 3285.024 ms stddev 4113.841
progress: 900.0 s, 47.2 tps, lat 1896.698 ms stddev 2182.695
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 25
query mode: simple
number of clients: 90
number of threads: 8
duration: 900 s
number of transactions actually processed: 38175
latency average = 2128.606 ms
latency stddev = 3948.634 ms
tps = 42.151205 (including connections establishing)
tps = 42.151589 (excluding connections establishing)
-- waiting 0s... (always) port | pg_stat_replication | pid  |     wal     | replay_loc  | diff | 
?column? |   state   |   app   | sync_state 6972 | pg_stat_replication | 2545 | 18/432B2180 | 18/432B2180 |    0 | 
t        | streaming | derail2 | async

2017.05.08 23:19:22
-- getting md5 (cb)
6972 a,b,t,h:  2500000     25    250  38175  b2ba48b53 b3788a837 
d1afac950 d4abcc72e master
6973 a,b,t,h:  2500000     25    250  38175  b2ba48b53 b3788a837 
d1afac950 d4abcc72e replica ok  bee2312c7
2017.05.08 23:20:48
 port | pg_stat_replication | pid  |     wal     | replay_loc  |   diff   | ?column? |   state   |   app   | sync_state
6972| pg_stat_replication | 2545 | 18/4AEEC8C0 | 18/453FBD20 | 
 
95357856 | f        | streaming | derail2 | async
------------------------------------


------------------ failure:
creating tables...
1777100 of 2500000 tuples (71%) done (elapsed 5.06 s, remaining 2.06 s)
2500000 of 2500000 tuples (100%) done (elapsed 7.41 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
create publication pub1 for all tables;
create subscription sub1 connection 'port=6972 application_name=derail2' 
publication pub1 with (disabled);
alter subscription sub1 enable; port | pg_stat_replication |  pid  |     wal     | replay_loc | diff | 
?column? |  state  |   app   | sync_state 6972 | pg_stat_replication | 11945 | 18/5E2913D0 |            |      |
 | catchup | derail2 | async
 

-- pgbench -c 90 -j 8 -T 900 -P 180 -n   --  scale 25
progress: 180.0 s, 78.4 tps, lat 1138.348 ms stddev 2884.815
progress: 360.0 s, 69.2 tps, lat 1309.716 ms stddev 2594.231
progress: 540.0 s, 59.0 tps, lat 1519.146 ms stddev 2033.400
progress: 720.0 s, 62.9 tps, lat 1421.854 ms stddev 1775.066
progress: 900.0 s, 57.0 tps, lat 1575.693 ms stddev 1681.800
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 25
query mode: simple
number of clients: 90
number of threads: 8
duration: 900 s
number of transactions actually processed: 58846
latency average = 1378.259 ms
latency stddev = 2304.159 ms
tps = 65.224168 (including connections establishing)
tps = 65.224788 (excluding connections establishing)
-- waiting 0s... (always) port | pg_stat_replication |  pid  |     wal     | replay_loc | diff | 
?column? |  state  |   app   | sync_state 6972 | pg_stat_replication | 11948 | 18/7469A038 |            |      |
 | startup | derail2 | async 6972 | pg_stat_replication | 12372 | 18/7469A038 |            |      |          | startup
|derail2 | async
 

------------------------------------

During my tests, I keep an eye on pg_stat_replication (refreshing every 
2s), and when I see those two 'startup'-lines in pg_stat_replication 
without any 'streaming'-line I know the test is going to fail.  I 
believe this has been true for all failure cases that I've seen (except 
the much more rare stuck-DROP-SUBSCRIPTION which is mentioned in another 
thread).

Sorry, I have not been able to get any thing more clear or definitive...


thanks,


Erik Rijkers











pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: Re: [HACKERS] logical replication syntax (was DROP SUBSCRIPTION,query cancellations and slot handling)
Next
From: Masahiko Sawada
Date:
Subject: Re: [HACKERS] Logical replication - TRAP: FailedAssertion in pgstat.c