Thread: Regression tests failures on Windows Server 2019 - on master at commit # d816f366b

Regression tests failures on Windows Server 2019 - on master at commit # d816f366b

From
Bharath Rupireddy
Date:
Hi,

It looks like regression tests are failing on Windows Server 2019 [1].
Ignore if it's reported elsewhere.

[1] https://github.com/postgres/postgres/runs/5419324953
❌ 00:02 test_pl

[08:28:53.758]
[08:28:53.758] c:\cirrus>call "C:/Program
Files/Git/usr/bin/timeout.exe" -v -k60s 15m perl
src/tools/msvc/vcregress.pl plcheck
[08:28:53.953] ============================================================
[08:28:53.953] Checking plpgsql
[08:28:53.977] (using postmaster on Unix socket, default port)
[08:28:53.977] ============== dropping database "pl_regression"
==============
[08:28:56.044] psql: error: connection to server at "localhost" (::1),
port 5432 failed: Connection refused (0x0000274D/10061)
[08:28:56.044] Is the server running on that host and accepting TCP/IP
connections?
[08:28:56.044] connection to server at "localhost" (127.0.0.1), port
5432 failed: Connection refused (0x0000274D/10061)
[08:28:56.044] Is the server running on that host and accepting TCP/IP
connections?
[08:28:56.050] command failed: "c:/cirrus/Debug/psql/psql" -X -c "SET
client_min_messages = warning" -c "DROP DATABASE IF EXISTS
\"pl_regression\"" "postgres"
[08:28:56.063]
[08:28:56.063] c:\cirrus>if 2 NEQ 0 exit /b 2
[08:28:56.084]
[08:28:56.084] Exit status: 2

Regards,
Bharath Rupireddy.



On Fri, Mar 4, 2022 at 10:37 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> It looks like regression tests are failing on Windows Server 2019 [1].
> Ignore if it's reported elsewhere.

It's broken since 46ab07ff "Clean up assorted failures under clang's
-fsanitize=undefined checks.":

https://github.com/postgres/postgres/commits/master

I don't see what that patch has to do with the symptoms.  It looks a
bit like the cluster started by the "startcreate" step ("pg_ctl.exe
start ...") is mysteriously no longer running when we get to the
"test_pl" step ("Connection refused").



Thomas Munro <thomas.munro@gmail.com> writes:
> On Fri, Mar 4, 2022 at 10:37 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
>> It looks like regression tests are failing on Windows Server 2019 [1].
>> Ignore if it's reported elsewhere.

> It's broken since 46ab07ff "Clean up assorted failures under clang's
> -fsanitize=undefined checks.":

> https://github.com/postgres/postgres/commits/master

> I don't see what that patch has to do with the symptoms.

The buildfarm is not unhappy, so I'd be looking at what changed
recently in the cfbot's setup.

I find it a little suspicious that startcreate is merrily starting
the postmaster on port 5432, without concern for the possibility
that that port is in use or blocked by a packet filter.

            regards, tom lane



Hi,

On 2022-03-04 09:54:28 -0500, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > https://github.com/postgres/postgres/commits/master
> 
> > I don't see what that patch has to do with the symptoms.
> 
> The buildfarm is not unhappy, so I'd be looking at what changed
> recently in the cfbot's setup.

Very odd.


> I find it a little suspicious that startcreate is merrily starting
> the postmaster on port 5432, without concern for the possibility
> that that port is in use or blocked by a packet filter.

It's a container that's just created for that run, so there shouldn't be
anything else running. We also see that the server successfully starts:

2022-03-03 23:24:15.261 GMT [5504][postmaster] LOG:  starting PostgreSQL 15devel, compiled by Visual C++ build 1929,
64-bit
2022-03-03 23:24:15.270 GMT [5504][postmaster] LOG:  listening on IPv6 address "::1", port 5432
2022-03-03 23:24:15.270 GMT [5504][postmaster] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2022-03-03 23:24:15.321 GMT [5808][startup] LOG:  database system was shut down at 2022-03-03 23:24:15 GMT
2022-03-03 23:24:15.341 GMT [5504][postmaster] LOG:  database system is ready to accept connections

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> On 2022-03-04 09:54:28 -0500, Tom Lane wrote:
>> I find it a little suspicious that startcreate is merrily starting
>> the postmaster on port 5432, without concern for the possibility
>> that that port is in use or blocked by a packet filter.

> It's a container that's just created for that run, so there shouldn't be
> anything else running. We also see that the server successfully starts:

True, so the port was free.  The postmaster wouldn't know whether there's
a relevant packet filter though.  The reason I'm harping on this is that
it's hard to see why this postmaster start would fail before anything
was actually done, when several previous starts on other ports behaved
just fine.

Actually ... looking closer at the failure log:

[09:41:50.589] Checking plpgsql
[09:41:50.608] (using postmaster on Unix socket, default port)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[09:41:50.608] ============== dropping database "pl_regression"      ==============
[09:41:52.731] psql: error: connection to server at "localhost" (::1), port 5432 failed: Connection refused
(0x0000274D/10061)
[09:41:52.731]     Is the server running on that host and accepting TCP/IP connections?
[09:41:52.731] connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused (0x0000274D/10061)
[09:41:52.731]     Is the server running on that host and accepting TCP/IP connections?

pg_regress thinks the postmaster is listening to a Unix socket.
Maybe it's *only* listening to a Unix socket.  In any case,
psql has very clearly not been told to use a Unix socket.
Something is not wired up correctly there.

            regards, tom lane



Hi,

On 2022-03-04 09:10:14 -0800, Andres Freund wrote:
> > I find it a little suspicious that startcreate is merrily starting
> > the postmaster on port 5432, without concern for the possibility
> > that that port is in use or blocked by a packet filter.
> 
> It's a container that's just created for that run, so there shouldn't be
> anything else running. We also see that the server successfully starts:
> 
> 2022-03-03 23:24:15.261 GMT [5504][postmaster] LOG:  starting PostgreSQL 15devel, compiled by Visual C++ build 1929,
64-bit
> 2022-03-03 23:24:15.270 GMT [5504][postmaster] LOG:  listening on IPv6 address "::1", port 5432
> 2022-03-03 23:24:15.270 GMT [5504][postmaster] LOG:  listening on IPv4 address "127.0.0.1", port 5432
> 2022-03-03 23:24:15.321 GMT [5808][startup] LOG:  database system was shut down at 2022-03-03 23:24:15 GMT
> 2022-03-03 23:24:15.341 GMT [5504][postmaster] LOG:  database system is ready to accept connections

Oh, huh. I added a printout of the running tasks, and it sure looks like
postgres is not running anymore when plcheck is run, without anything ending
up in the server log...

It looks like the CI agent had some changes about stdin / stdout of the shells
used to start scripts. The version between the last successful run and the
first failing run changed.
https://github.com/cirruslabs/cirrus-ci-agent/commits/master

I wonder if we're missing some steps, at least on windows, to make pg_ctl
start independent of the starting shell?

Greetings,

Andres Freund



Hi,

On 2022-03-04 12:23:43 -0500, Tom Lane wrote:
> Actually ... looking closer at the failure log:
> 
> [09:41:50.589] Checking plpgsql
> [09:41:50.608] (using postmaster on Unix socket, default port)
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> [09:41:50.608] ============== dropping database "pl_regression"      ==============
> [09:41:52.731] psql: error: connection to server at "localhost" (::1), port 5432 failed: Connection refused
(0x0000274D/10061)
> [09:41:52.731]     Is the server running on that host and accepting TCP/IP connections?
> [09:41:52.731] connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused
(0x0000274D/10061)
> [09:41:52.731]     Is the server running on that host and accepting TCP/IP connections?
> 
> pg_regress thinks the postmaster is listening to a Unix socket.
> Maybe it's *only* listening to a Unix socket.  In any case,
> psql has very clearly not been told to use a Unix socket.
> Something is not wired up correctly there.

I saw that too, but I don't think it's the primary problem. The server is
listening on ::1 as we know from the log, and quite evidently psql is trying
to connect to that too.  This output also looked this way before the failures.

pg_regress outputs the above message when neither PGHOST, PGPORT or
--host/--port are set. On windows that nevertheless ends up with connecting to
localhost.

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> On 2022-03-04 12:23:43 -0500, Tom Lane wrote:
>> pg_regress thinks the postmaster is listening to a Unix socket.

> pg_regress outputs the above message when neither PGHOST, PGPORT or
> --host/--port are set. On windows that nevertheless ends up with connecting to
> localhost.

Yeah, I just traced that down.  pg_regress was not updated when libpq's
behavior was changed for platform-varying DEFAULT_PGSOCKET_DIR.
I'll go fix that, but you're right that it's cosmetic.

            regards, tom lane



Hi,

On 2022-03-04 09:30:37 -0800, Andres Freund wrote:
> I wonder if we're missing some steps, at least on windows, to make pg_ctl
> start independent of the starting shell?

Sure looks that way. On windows, if I do pg_ctl start, then hit ctrl-c, the
server shuts down.

Greetings,

Andres Freund



On 2022-03-04 09:46:44 -0800, Andres Freund wrote:
> On 2022-03-04 09:30:37 -0800, Andres Freund wrote:
> > I wonder if we're missing some steps, at least on windows, to make pg_ctl
> > start independent of the starting shell?
>
> Sure looks that way. On windows, if I do pg_ctl start, then hit ctrl-c, the
> server shuts down.

Short term the easiest fix might be to start postgres for those tests as a
service. But it seems we should fix whatever the cause of that
terminal-connectedness behaviour is.

I'm out for ~2-3h. I started a test run with using a service just now:
https://cirrus-ci.com/task/5519573792325632 but I very well might have typoed
something...



Hi,

On 2022-03-04 09:57:35 -0800, Andres Freund wrote:
> On 2022-03-04 09:46:44 -0800, Andres Freund wrote:
> > On 2022-03-04 09:30:37 -0800, Andres Freund wrote:
> > > I wonder if we're missing some steps, at least on windows, to make pg_ctl
> > > start independent of the starting shell?
> >
> > Sure looks that way. On windows, if I do pg_ctl start, then hit ctrl-c, the
> > server shuts down.
> 
> Short term the easiest fix might be to start postgres for those tests as a
> service. But it seems we should fix whatever the cause of that
> terminal-connectedness behaviour is.
> 
> I'm out for ~2-3h. I started a test run with using a service just now:
> https://cirrus-ci.com/task/5519573792325632 but I very well might have typoed
> something...

Seems to have worked for the first few tests at least. Unless somebody wants
to clean up that commit and push it, I'll do so once I'm back.


Perhaps pg_ctl needs to call FreeConsole() or such?

https://docs.microsoft.com/en-us/windows/console/closing-a-console

Or perhaps pg_ctl ought to pass CREATE_NEW_PROCESS_GROUP to CreateProcess()?
The lack of a process group would explain why we're getting signalled on
ctrl-c...

Greetings,

Andres Freund



On Sat, Mar 5, 2022 at 7:10 AM Andres Freund <andres@anarazel.de> wrote:
> Or perhaps pg_ctl ought to pass CREATE_NEW_PROCESS_GROUP to CreateProcess()?
> The lack of a process group would explain why we're getting signalled on
> ctrl-c...

I thought that sounded promising, given that the recent Cirrus agent
commit you pointed to says "Always kill spawned shell's process group
to avoid pipe FD hangs", and given that we do something conceptually
similar on Unix.  It doesn't seem to help, though...

https://cirrus-ci.com/task/5572163880091648



Hi,

On 2022-03-04 09:57:35 -0800, Andres Freund wrote:
> On 2022-03-04 09:46:44 -0800, Andres Freund wrote:
> > On 2022-03-04 09:30:37 -0800, Andres Freund wrote:
> > > I wonder if we're missing some steps, at least on windows, to make pg_ctl
> > > start independent of the starting shell?
> >
> > Sure looks that way. On windows, if I do pg_ctl start, then hit ctrl-c, the
> > server shuts down.
> 
> Short term the easiest fix might be to start postgres for those tests as a
> service. But it seems we should fix whatever the cause of that
> terminal-connectedness behaviour is.
> 
> I'm out for ~2-3h. I started a test run with using a service just now:
> https://cirrus-ci.com/task/5519573792325632 but I very well might have typoed
> something...

That fixed the immediate problem, but dblink, postgres_fdw tests failed:

https://api.cirrus-ci.com/v1/artifact/task/5519573792325632/log/contrib/dblink/regression.diffs
 FROM dblink(connection_parameters(),'SELECT * FROM foo') AS t(a int, b text, c text[])
 WHERE t.a > 7;
- a | b |     c      
----+---+------------
- 8 | i | {a8,b8,c8}
- 9 | j | {a9,b9,c9}
-(2 rows)
-
+ERROR:  could not establish connection
+DETAIL:  connection to server at "localhost" (::1), port 5432 failed: FATAL:
 role "SYSTEM" does not exist


https://api.cirrus-ci.com/v1/artifact/task/5519573792325632/log/contrib/postgres_fdw/regression.diffs
+ERROR:  could not connect to server "loopback"

and it also seems to redirect logging to the event log without further
configuration...

The dblink and fdw failures can presumably be fixed by passing current_user as
the username. That seems like a good idea anyway?

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> That fixed the immediate problem, but dblink, postgres_fdw tests failed:
> +ERROR:  could not establish connection
> +DETAIL:  connection to server at "localhost" (::1), port 5432 failed: FATAL:
>  role "SYSTEM" does not exist

[ scratches head... ]  Where is libpq getting that username from?
Why is it different from whatever we'd determined during initdb?
(Maybe a case-folding issue??)

> The dblink and fdw failures can presumably be fixed by passing current_user as
> the username. That seems like a good idea anyway?

I don't think it's a good idea to hack that without understanding
why it's suddenly going wrong.

            regards, tom lane



Hi,

On 2022-03-05 09:29:01 +1300, Thomas Munro wrote:
> On Sat, Mar 5, 2022 at 7:10 AM Andres Freund <andres@anarazel.de> wrote:
> > Or perhaps pg_ctl ought to pass CREATE_NEW_PROCESS_GROUP to CreateProcess()?
> > The lack of a process group would explain why we're getting signalled on
> > ctrl-c...
> 
> I thought that sounded promising, given that the recent Cirrus agent
> commit you pointed to says "Always kill spawned shell's process group
> to avoid pipe FD hangs", and given that we do something conceptually
> similar on Unix.  It doesn't seem to help, though...
> 
> https://cirrus-ci.com/task/5572163880091648

I suspect one also needs the console detach thing.

I don't really understand why start_postmaster() bothers to wrap postmaster
start through cmd.exe, particularly when it prevents us from getting
postmaster's pid. Also the caveats around cmd.exe and sharing mode.

Greetings,

Andres Freund



Hi,

On 2022-03-04 16:51:32 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > That fixed the immediate problem, but dblink, postgres_fdw tests failed:
> > +ERROR:  could not establish connection
> > +DETAIL:  connection to server at "localhost" (::1), port 5432 failed: FATAL:
> >  role "SYSTEM" does not exist
>
> [ scratches head... ]  Where is libpq getting that username from?
> Why is it different from whatever we'd determined during initdb?
> (Maybe a case-folding issue??)

When running as a service (via pg_ctl register) the default username to run
under appears to be SYSTEM. Which then differs from the user that vcregress.pl
runs under. Trying to make it use the current user now - I don't know what
permissions services are needed to run a service as a user or such.


> > The dblink and fdw failures can presumably be fixed by passing current_user as
> > the username. That seems like a good idea anyway?
>
> I don't think it's a good idea to hack that without understanding
> why it's suddenly going wrong.

I think I understand why - seems more a question of whether we want to support
running tests against a server running as a different user.

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> I don't really understand why start_postmaster() bothers to wrap postmaster
> start through cmd.exe, particularly when it prevents us from getting
> postmaster's pid. Also the caveats around cmd.exe and sharing mode.

I think the basic idea is to avoid having to write our own code to do
the I/O redirections --- that's certainly why we use a shell on the
Unix side.  But it might well be that biting the bullet and handling
redirection ourselves would be easier than coping with all these
other problems.

            regards, tom lane



On 2022-03-04 14:04:44 -0800, Andres Freund wrote:
> Trying to make it use the current user now - I don't know what
> permissions services are needed to run a service as a user or such.

My first attempt of using %USERNAME% failed:

[22:10:20.862] c:\cirrus>call tmp_install\bin\pg_ctl.exe register -D tmp_check/db "-UContainerAdministrator"
[22:10:20.889] pg_ctl: could not register service "PostgreSQL": error code 1057

seems to require a domain for some reason:
[22:33:54.599] c:\cirrus>call tmp_install\bin\pg_ctl.exe register -Dtmp_check/db -U "User
Manager\ContainerAdministrator"

but that then doesn't start:
[22:33:54.660] c:\cirrus>call net start PostgreSQL
[22:33:55.887] System error 1068 has occurred.

So it indeed seems hard to start a service as the current user. At least with
my limited windows knowledge.



On Sat, Mar 5, 2022 at 10:56 AM Andres Freund <andres@anarazel.de> wrote:
> I suspect one also needs the console detach thing.

I tried adding DETACHED_PROCESS (which should be like calling
FreeConsole() in child process) and then I tried CREATE_NEW_CONSOLE
instead, on top of CREATE_NEW_PROCESS_GROUP.  Neither helped, though I
lost the postmaster's output.

Things I tried and their output are here:
https://github.com/macdice/postgres/commits/hack3



Hi,

On 2022-03-05 13:21:26 +1300, Thomas Munro wrote:
> On Sat, Mar 5, 2022 at 10:56 AM Andres Freund <andres@anarazel.de> wrote:
> > I suspect one also needs the console detach thing.
>
> I tried adding DETACHED_PROCESS (which should be like calling
> FreeConsole() in child process) and then I tried CREATE_NEW_CONSOLE
> instead, on top of CREATE_NEW_PROCESS_GROUP.  Neither helped, though I
> lost the postmaster's output.

I think the issue with the process group is real, but independent of the
failing test.Locally just specifying CREATE_NEW_PROCESS_GROUP fixes the
problem of a pg_ctl start'ed database being stopped after ctrl-c.

I think CI failing is due to cirrus bug, assuming a bit too much similarity
between unix and windows....

https://github.com/cirruslabs/cirrus-ci-agent/issues/218#issuecomment-1059657781

As indicated in the message above it, there's a workaround. Not sure if worth
committing, if they were to fix it in a few days?  cfbot could be repaired by
just adding a repo environment variable of CIRRUS_AGENT_VERSION 1.73.2...
OTOH, committing and reverting a single line + comment is cheap.

Greetings,

Andres Freund



On Sat, Mar 5, 2022 at 4:19 PM Andres Freund <andres@anarazel.de> wrote:
> https://github.com/cirruslabs/cirrus-ci-agent/issues/218#issuecomment-1059657781

Oof.  Nice detective work.

> As indicated in the message above it, there's a workaround. Not sure if worth
> committing, if they were to fix it in a few days?  cfbot could be repaired by
> just adding a repo environment variable of CIRRUS_AGENT_VERSION 1.73.2...
> OTOH, committing and reverting a single line + comment is cheap.

I vote for committing that workaround into the tree temporarily,
because it's not just cfbot, it's also everyone's dev branches on
Github + the official mirror that are red.



Hi,

On 2022-03-05 16:39:21 +1300, Thomas Munro wrote:
> On Sat, Mar 5, 2022 at 4:19 PM Andres Freund <andres@anarazel.de> wrote:
> > https://github.com/cirruslabs/cirrus-ci-agent/issues/218#issuecomment-1059657781
> 
> Oof.  Nice detective work.

Thanks.

> > As indicated in the message above it, there's a workaround. Not sure if worth
> > committing, if they were to fix it in a few days?  cfbot could be repaired by
> > just adding a repo environment variable of CIRRUS_AGENT_VERSION 1.73.2...
> > OTOH, committing and reverting a single line + comment is cheap.
> 
> I vote for committing that workaround into the tree temporarily,
> because it's not just cfbot, it's also everyone's dev branches on
> Github + the official mirror that are red.

I'll do so after making dinner, unless you want to do so sooner. It did fix
the problem (intermixed with a few irrelevant changes): https://cirrus-ci.com/task/4928987829895168

- Andres



On 2022-03-04 20:06:43 -0800, Andres Freund wrote:
> On 2022-03-05 16:39:21 +1300, Thomas Munro wrote:
> > I vote for committing that workaround into the tree temporarily,
> > because it's not just cfbot, it's also everyone's dev branches on
> > Github + the official mirror that are red.
> 
> I'll do so after making dinner, unless you want to do so sooner. It did fix
> the problem (intermixed with a few irrelevant changes): https://cirrus-ci.com/task/4928987829895168

Pushed.



On 3/4/22 17:04, Andres Freund wrote:
> Hi,
>
> On 2022-03-04 16:51:32 -0500, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> That fixed the immediate problem, but dblink, postgres_fdw tests failed:
>>> +ERROR:  could not establish connection
>>> +DETAIL:  connection to server at "localhost" (::1), port 5432 failed: FATAL:
>>>  role "SYSTEM" does not exist
>> [ scratches head... ]  Where is libpq getting that username from?
>> Why is it different from whatever we'd determined during initdb?
>> (Maybe a case-folding issue??)
> When running as a service (via pg_ctl register) the default username to run
> under appears to be SYSTEM. Which then differs from the user that vcregress.pl
> runs under. Trying to make it use the current user now - I don't know what
> permissions services are needed to run a service as a user or such.


SeServiceLogonRight is what the user needs to run the service.


To manage it (e.g. stop or start it) they need some extra permissions,
see for example <http://get-carbon.org/Grant-ServicePermission.html>


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Hi,

On 2022-03-04 22:01:22 -0800, Andres Freund wrote:
> On 2022-03-04 20:06:43 -0800, Andres Freund wrote:
> > On 2022-03-05 16:39:21 +1300, Thomas Munro wrote:
> > > I vote for committing that workaround into the tree temporarily,
> > > because it's not just cfbot, it's also everyone's dev branches on
> > > Github + the official mirror that are red.
> >
> > I'll do so after making dinner, unless you want to do so sooner. It did fix
> > the problem (intermixed with a few irrelevant changes): https://cirrus-ci.com/task/4928987829895168
>
> Pushed.

Cirrus now provides a way to get the old behaviour back without pinning an old
agent version. See attached and a run passing the problematic steps [1]:

The way this is intended to be done in cirrus (rather than preventing it from
killing "escaped" processes) would be to use 'background_script' to run
something longer running.

Unfortunately that's surprisingly hard with our tooling, or maybe I'm just
daft:

1) We don't have a way to wait for server startup to finish if we don't block
   on pg_ctl start. So we'd have to use 'sleep', write a loop around
   pq_isready. Perhaps pg_ctl should have an option to wait for server startup
   / shutdown without doing the starting/stopping itself?

2) There's no trivial way of starting postgres with pg_ctl and then waiting
   for the server to be shut down in the background script. The easiest would
   be to start psql just wait for it to be killed by an immediate shutdown :/.

3) We can't just start postgres in the foreground to get around 2), because
   pg_ctl does the dropping of permissions we need, rather than postgres
   itself. It'd also need 1).

Seems like we're missing some fairly basic tooling somehow...

Regards,

Andres

[1] https://cirrus-ci.com/task/5810196135018496

Attachment
On 2022-03-19 11:16:48 -0700, Andres Freund wrote:
> Cirrus now provides a way to get the old behaviour back without pinning an old
> agent version. See attached

I've pushed that now. If we can come up with a better recipe for starting
postgres in the background, cool. But until then...