Thread: LDAP check flapping on crake due to race

LDAP check flapping on crake due to race

From
Thomas Munro
Date:
Hi,

There are one or two failures per month on crake.  It looks like when
authentication is rejected, as expected in the tests, the psql process
is exiting, but there is a race where the Perl script still wants to
write a dummy query to its stdin (?), so you get:

psql: FATAL:  LDAP authentication failed for user "test1"
ack Broken pipe: write( 13, 'SELECT 1' ) at
/usr/share/perl5/vendor_perl/IPC/Run/IO.pm line 549.

Example:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-11-10%2023%3A36%3A04

tmunro=> select animal, snapshot, branch from run where fail_stage =
'ldapCheck' order by snapshot desc;
 animal |      snapshot       |    branch
--------+---------------------+---------------
 crake  | 2020-08-02 02:32:30 | REL_13_STABLE
 crake  | 2020-07-22 23:36:04 | REL_12_STABLE
 crake  | 2020-07-14 00:52:04 | REL_13_STABLE
 crake  | 2020-05-15 17:35:05 | REL_11_STABLE
 crake  | 2020-04-07 20:51:03 | REL_12_STABLE
 mantid | 2020-03-04 18:17:58 | REL_12_STABLE
 mantid | 2020-03-04 17:59:58 | REL_11_STABLE
 crake  | 2020-01-17 14:33:21 | REL_12_STABLE
 crake  | 2019-11-10 23:36:04 | REL_11_STABLE
 crake  | 2019-09-09 08:48:25 | HEAD
 crake  | 2019-08-05 21:18:23 | REL_12_STABLE
 crake  | 2019-07-19 01:33:31 | HEAD
 crake  | 2019-07-16 01:06:02 | REL_11_STABLE
(13 rows)

(Ignore mantid, it had a temporary setup problem that was resolved.)



Re: LDAP check flapping on crake due to race

From
Noah Misch
Date:
On Sun, Aug 02, 2020 at 05:29:57PM +1200, Thomas Munro wrote:
> There are one or two failures per month on crake.  It looks like when
> authentication is rejected, as expected in the tests, the psql process
> is exiting, but there is a race where the Perl script still wants to
> write a dummy query to its stdin (?), so you get:
> 
> psql: FATAL:  LDAP authentication failed for user "test1"
> ack Broken pipe: write( 13, 'SELECT 1' ) at
> /usr/share/perl5/vendor_perl/IPC/Run/IO.pm line 549.

Do you suppose a fix like e12a472 would cover this?  ("psql <&-" fails with
status 1 after successful authentication, and authentication failure gives
status 2.)



Re: LDAP check flapping on crake due to race

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Sun, Aug 02, 2020 at 05:29:57PM +1200, Thomas Munro wrote:
>> There are one or two failures per month on crake.  It looks like when
>> authentication is rejected, as expected in the tests, the psql process
>> is exiting, but there is a race where the Perl script still wants to
>> write a dummy query to its stdin (?), so you get:
>> psql: FATAL:  LDAP authentication failed for user "test1"
>> ack Broken pipe: write( 13, 'SELECT 1' ) at
>> /usr/share/perl5/vendor_perl/IPC/Run/IO.pm line 549.

> Do you suppose a fix like e12a472 would cover this?  ("psql <&-" fails with
> status 1 after successful authentication, and authentication failure gives
> status 2.)

AFAICT the failure is happening down inside PostgresNode::psql's call
of IPC::Run::run, so we don't really have the option to adjust things
in exactly that way.  (And messing with sub psql for the benefit of
this one caller seems pretty risky anyway.)

I'm inclined to suggest that the LDAP test's test_access could use
an empty stdin and pass "-c 'SELECT 1'" as a command line option
instead.  (Maybe that's exactly what you meant, but I'm not sure.)

I've not been able to duplicate this locally, so I have no idea if
that'd really fix it.

            regards, tom lane



Re: LDAP check flapping on crake due to race

From
Thomas Munro
Date:
On Mon, Aug 3, 2020 at 4:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm inclined to suggest that the LDAP test's test_access could use
> an empty stdin and pass "-c 'SELECT 1'" as a command line option
> instead.  (Maybe that's exactly what you meant, but I'm not sure.)

Good idea.  Here's a patch like that.

> I've not been able to duplicate this locally, so I have no idea if
> that'd really fix it.

Me neither -- I guess someone who enjoys perl could hack IPC::Run to
take a short nap at the right moment.

Attachment

Re: LDAP check flapping on crake due to race

From
Noah Misch
Date:
On Mon, Aug 03, 2020 at 12:12:57PM +1200, Thomas Munro wrote:
> On Mon, Aug 3, 2020 at 4:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I'm inclined to suggest that the LDAP test's test_access could use
> > an empty stdin and pass "-c 'SELECT 1'" as a command line option
> > instead.  (Maybe that's exactly what you meant, but I'm not sure.)
> 
> Good idea.  Here's a patch like that.

While I had meant a different approach, this is superior.

> > I've not been able to duplicate this locally, so I have no idea if
> > that'd really fix it.
> 
> Me neither -- I guess someone who enjoys perl could hack IPC::Run to
> take a short nap at the right moment.

Not essential to reproduce first, I think.



Re: LDAP check flapping on crake due to race

From
Thomas Munro
Date:
On Mon, Aug 3, 2020 at 12:29 PM Noah Misch <noah@leadboat.com> wrote:
> On Mon, Aug 03, 2020 at 12:12:57PM +1200, Thomas Munro wrote:
> > On Mon, Aug 3, 2020 at 4:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > I'm inclined to suggest that the LDAP test's test_access could use
> > > an empty stdin and pass "-c 'SELECT 1'" as a command line option
> > > instead.  (Maybe that's exactly what you meant, but I'm not sure.)
> >
> > Good idea.  Here's a patch like that.
>
> While I had meant a different approach, this is superior.

Thanks.  Pushed.