Re: [HACKERS] snapbuild woes - Mailing list pgsql-hackers

From Erik Rijkers
Subject Re: [HACKERS] snapbuild woes
Date
Msg-id 84a48a73304ed5363433ec4b9fbb4438@xs4all.nl
Whole thread Raw
In response to Re: [HACKERS] snapbuild woes  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
Responses Re: [HACKERS] snapbuild woes  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
List pgsql-hackers
On 2017-05-09 11:50, Petr Jelinek wrote:

> I rebased the above mentioned patch to apply to the patches Andres 
> sent,
> if you could try to add it on top of what you have and check if it 
> still
> fails, that would be helpful.

It still fails.

With these patches

- 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
- 00002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
- fix-statistics-reporting-in-logical-replication-work.patch +
- Skip-unnecessary-snapshot-builds.patch

built again on top of 44c528810a1 ( so I had to add the 
'fix-statistics-rep*' patch because without it I immediately got that 
Assertion failure again ).

As always most runs succeed (especially on this large 192GB 16-core 
server).

But attached is an output file of a number of runs of my 
pgbench_derail2.sh test.

Overal result:

-- out_20170509_1635.txt
       3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n   --  scale 25
       2 -- All is well.
       1 -- Not good, but breaking out of wait (21 times no change)

I broke it off after iteration 4, so 5 never ran, and
iteration 1 failed due to a mistake in the harness (somethind stupid I 
did) - not interesting.

iteration 2 succeeds. (eventually has 'replica ok')

iteration 3 succeeds. (eventually has 'replica ok')

iteration 4 fails.
   Just after 'alter subscription sub1 enable' I caught (as is usual) 
pg_stat_replication.state as 'catchup'. So far so good.
   After the 15-minute pgbench run pg_stat_replication has only 2 
'startup' lines (and none 'catchup' or 'streaming'):

  port | pg_stat_replication |  pid   |     wal     | replay_loc | diff | 
?column? |  state  |   app   | sync_state
  6972 | pg_stat_replication | 108349 | 19/8FBCC248 |            |      | 
          | startup | derail2 | async
  6972 | pg_stat_replication | 108351 | 19/8FBCC248 |            |      | 
          | startup | derail2 | async

(that's from:
    select $port1 as port,'pg_stat_replication' as pg_stat_replication, 
pid
      , pg_current_wal_location() wal, replay_location replay_loc, 
pg_current_wal_location() - replay_location as diff
      , pg_current_wal_location() <= replay_location
      , state, application_name as app, sync_state
    from pg_stat_replication
)

This remains in this state for as long as my test-programs lets it 
(i.e., 20 x 30s, or something like that, and then the loop is exited); 
in the ouput file it says: 'Not good, but breaking out of wait'

Below is the accompanying ps (with the 2 'deranged senders' as Jeff 
Janes would surely call them):


UID         PID   PPID  C STIME TTY      STAT   TIME CMD
rijkers  107147      1  0 17:11 pts/35   S+     0:00 
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres 
-D /var/data1/pg_stuff/pg_installations
rijkers  107149 107147  0 17:11 ?        Ss     0:00  \_ postgres: 
logger process
rijkers  107299 107147  0 17:11 ?        Ss     0:01  \_ postgres: 
checkpointer process
rijkers  107300 107147  0 17:11 ?        Ss     0:00  \_ postgres: 
writer process
rijkers  107301 107147  0 17:11 ?        Ss     0:00  \_ postgres: wal 
writer process
rijkers  107302 107147  0 17:11 ?        Ss     0:00  \_ postgres: 
autovacuum launcher process
rijkers  107303 107147  0 17:11 ?        Ss     0:00  \_ postgres: stats 
collector process
rijkers  107304 107147  0 17:11 ?        Ss     0:00  \_ postgres: 
bgworker: logical replication launcher
rijkers  108348 107147  0 17:12 ?        Ss     0:01  \_ postgres: 
bgworker: logical replication worker for subscription 70310 sync 70293
rijkers  108350 107147  0 17:12 ?        Ss     0:00  \_ postgres: 
bgworker: logical replication worker for subscription 70310 sync 70298
rijkers  107145      1  0 17:11 pts/35   S+     0:02 
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres 
-D /var/data1/pg_stuff/pg_installations
rijkers  107151 107145  0 17:11 ?        Ss     0:00  \_ postgres: 
logger process
rijkers  107160 107145  0 17:11 ?        Ss     0:08  \_ postgres: 
checkpointer process
rijkers  107161 107145  0 17:11 ?        Ss     0:07  \_ postgres: 
writer process
rijkers  107162 107145  0 17:11 ?        Ss     0:02  \_ postgres: wal 
writer process
rijkers  107163 107145  0 17:11 ?        Ss     0:00  \_ postgres: 
autovacuum launcher process
rijkers  107164 107145  0 17:11 ?        Ss     0:02  \_ postgres: stats 
collector process
rijkers  107165 107145  0 17:11 ?        Ss     0:00  \_ postgres: 
bgworker: logical replication launcher
rijkers  108349 107145  0 17:12 ?        Ss     0:27  \_ postgres: wal 
sender process rijkers [local] idle
rijkers  108351 107145  0 17:12 ?        Ss     0:26  \_ postgres: wal 
sender process rijkers [local] idle

I have had no time to add (or view) any CPUinfo.


Erik Rijkers




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: [HACKERS] Get stuck when dropping a subscription duringsynchronizing table
Next
From: Pavel Stehule
Date:
Subject: Re: [HACKERS] proposal psql \gdesc