Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members - Mailing list pgsql-hackers

From Tom Lane
Subject Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Date
Msg-id 12781.1444689666@sss.pgh.pa.us
Whole thread Raw
In response to Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes:
>> On Wed, Oct 7, 2015 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I think there is still room to salvage something without fully rewriting
>>> the postmaster invocation logic to avoid using CMD, because it's still
>>> true that checking whether the CMD process is still there should be as
>>> good as checking the postmaster proper.  We just can't use kill() for it.

> I had a look at that, and attached is an updated patch showing the
> concept of using an HANDLE that pg_ctl waits for. I agree that saving
> an HANDLE the way this patch does using a static variable is a bit
> ugly especially knowing that service registration uses
> test_postmaster_connection as well with directly a postmaster. The
> good thing is that tapcheck runs smoothly, with one single failure
> though: the second call to pg_ctl start -w may succeed instead of
> failing if kicked within an interval of 3 seconds after the first one,
> based on the tests on my Windows VM. My guess is that this is caused
> by the fact that we monitor the shell process and not the postmaster
> itself. Thoughts?

Yeah, it can still succeed because pg_ctl can't tell that the
postmaster.pid created by the earlier invocation isn't the one it wants.
It adopts the values out of that file, tests the connection, finds it
works, and declares victory, not realizing that the postmaster *it*
started will soon fail (or maybe already has).

Waiting more than 2 seconds is enough to make sure that
test_postmaster_connection sees the pre-existing pidfile as stale and
doesn't believe that it represents a successful postmaster start.

So there's still something to be desired on Windows; but it's still
better than before in that we can reliably detect child process exit
instead of having to use the five-second heuristic.  And of course on
Unix this is way better than before.

So I've pushed this with some cosmetic adjustments, as well as the not
so cosmetic adjustment of making the service-start path also use handle
testing.  If there are remaining problems, the buildfarm should tell us.

I'm not sure if this will completely fix our problems with "pg_ctl start"
related buildfarm failures on very slow critters.  It does get rid of the
hard wired 5-second timeout, but the 60-second timeout could still be an
issue.  I think Noah was considering a patch to allow that number to be
raised.  I'd be in favor of letting pg_ctl accept a default timeout length
from an environment variable, and then the slower critters could be fixed
by adjusting their buildfarm configurations.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Dmitry Vasilyev
Date:
Subject: Re: Postgres service stops when I kill client backend on Windows
Next
From: Robert Haas
Date:
Subject: Re: Some questions about the array.