Re: Stream replication test fails of cfbot/windows server 2019 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Stream replication test fails of cfbot/windows server 2019 |
Date | |
Msg-id | 20220113003442.rfejvb4pt7sllyla@alap3.anarazel.de Whole thread Raw |
In response to | Re: Stream replication test fails of cfbot/windows server 2019 (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
Hi, On 2022-01-13 12:40:00 +1300, Thomas Munro wrote: > On Thu, Jan 13, 2022 at 12:24 AM Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Wed, Jan 12, 2022 at 01:51:24PM +0300, Michail Nikolaev wrote: > > > https://cirrus-ci.com/task/6532060239101952 > > > https://cirrus-ci.com/task/4755551606276096 > > For the record, cfbot only started running the recovery tests on > Windows a couple of weeks ago (when the new improved .cirrus.yml > landed in the tree). I don't know if it's significant that Pavel's > patch is failing every time: > > https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3464 I think we need to make wait_for_catchup() log something more useful. It's pretty hard to debug failures involving it right now. E.g. the logs for https://cirrus-ci.com/task/5353503093686272 tells us that Waiting for replication conn standby_1's replay_lsn to pass '0/3023208' on primary # poll_query_until timed out executing this query: # SELECT '0/3023208' <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name ='standby_1'; # expecting this output: # t # last actual query output: # # with stderr: timed out waiting for catchup at t/001_stream_rep.pl line 50. and there's lots of instances of that query in the primary's logs. But nowhere do we see what the actual replication progress is. So we don't know if the problem is just that there's a record that doesn't need to be flushed to disk, or something more fundamental. I think instead of croak("timed out waiting for catchup") we should make wait_for_catchup() query the primary all columns of pg_stat_replication and report those. And perhaps also report the result of SELECT * FROM pg_control_recovery(), pg_control_checkpoint(); on the standby? > I don't know how many different kinds of flakiness we're suffering > from on Windows. Could these cases be explained by the FD_CLOSE > problem + timing differences? Maybe. There's certainly something odd going on: https://api.cirrus-ci.com/v1/artifact/task/5353503093686272/log/src/test/recovery/tmp_check/log/001_stream_rep_primary.log https://api.cirrus-ci.com/v1/artifact/task/5353503093686272/log/src/test/recovery/tmp_check/log/001_stream_rep_standby_1.log standby_1: 2022-01-12 08:21:36.543 GMT [8584][walreceiver] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 primary: 2022-01-12 08:21:38.855 GMT [6276][postmaster] LOG: received fast shutdown request 2022-01-12 08:21:39.146 GMT [6276][postmaster] LOG: database system is shut down 2022-01-12 08:21:50.235 GMT [5524][postmaster] LOG: starting PostgreSQL 15devel, compiled by Visual C++ build 1929, 64-bit 2022-01-12 08:21:50.417 GMT [5524][postmaster] LOG: database system is ready to accept connections standby_1: 2022-01-12 08:21:53.469 GMT [5108][walsender] [standby_2][2/0:0] LOG: received replication command: START_REPLICATION0/3000000 TIMELINE 1 2022-01-12 08:28:33.949 GMT [6484][postmaster] LOG: database system is shut down afaict standby_1's walreceiver never realized that the primary stopped? More evidence to that fact is that the above "last actual query output:" shows nothing rather than 'f' for SELECT '0/3023208' <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'standby_1'; I wonder if it's relevant that the .cirrus file uses "unix sockets" on windows as well, to avoid port conflicts (otherwise I saw frequent spurious test failures due to port conflicts). # Avoids port conflicts between concurrent tap test runs PG_TEST_USE_UNIX_SOCKETS: 1 It's not particularly hard to imagine that either our "windows unix socket" support still has some bugs, or that windows' implementation of unix sockets is borked. Greetings, Andres Freund
pgsql-hackers by date: