Re: Windows buildfarm members vs. new async-notify isolation test - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: Windows buildfarm members vs. new async-notify isolation test
Date
Msg-id 362bca0b-1c1c-c760-ab19-b5d9a14c69ea@gmail.com
Whole thread Raw
In response to Re: Windows buildfarm members vs. new async-notify isolation test  (Andrew Dunstan <andrew.dunstan@2ndquadrant.com>)
Responses Re: Windows buildfarm members vs. new async-notify isolation test
List pgsql-hackers

On 12/2/19 11:42 AM, Andrew Dunstan wrote:
> 
> On 12/2/19 11:23 AM, Tom Lane wrote:
>> I see from the buildfarm status page that since commits 6b802cfc7
>> et al went in a week ago, frogmouth and currawong have failed that
>> new test case every time, with the symptom
>>
>> ================== pgsql.build/src/test/isolation/regression.diffs ===================
>> *** c:/prog/bf/root/REL_10_STABLE/pgsql.build/src/test/isolation/expected/async-notify.out    Mon Nov 25 00:30:49
2019
>> --- c:/prog/bf/root/REL_10_STABLE/pgsql.build/src/test/isolation/results/async-notify.out    Mon Dec  2 00:54:26
2019
>> ***************
>> *** 93,99 ****
>>    step llisten: LISTEN c1; LISTEN c2;
>>    step lcommit: COMMIT;
>>    step l2commit: COMMIT;
>> - listener2: NOTIFY "c1" with payload "" from notifier
>>    step l2stop: UNLISTEN *;
>>    
>>    starting permutation: llisten lbegin usage bignotify usage
>> --- 93,98 ----
>>
>> (Note that these two critters don't run branches v11 and up, which
>> is why they're only showing this failure in 10 and 9.6.)
>>
>> drongo showed the same failure once in v10, and fairywren showed
>> it once in v12.  Every other buildfarm animal seems happy.
>>
>> I'm a little baffled as to what this might be --- some sort of
>> timing problem in our Windows signal emulation, perhaps?  But
>> if so, why haven't we found it years ago?
>>
>> I don't have any ability to test this myself, so would appreciate
>> help or ideas.
> 
> 
> 
> I can test things, but I don't really know what to test. FYI frogmouth
> and currawong run on virtualized XP. drongo anf fairywrne run on
> virtualized WS2019. Neither VM is heavily resourced.

Hi Andrew, if you have time you could perhaps check the
isolation test structure itself.  Like Tom, I don't have a
Windows box to test this.

I would be curious to see if there is a race condition in
src/test/isolation/isolationtester.c between the loop starting
on line 820:

   while ((res = PQgetResult(conn)))
   {
      ...
   }

and the attempt to consume input that might include NOTIFY
messages on line 861:

   PQconsumeInput(conn);

If the first loop consumes the commit message, gets no
further PGresult from PQgetResult, and finishes, and execution
proceeds to PQconsumeInput before the NOTIFY has arrived
over the socket, there won't be anything for PQnotifies to
return, and hence for try_complete_step to print before
returning.

I'm not sure if it is possible for the commit message to
arrive before the notify message in the fashion I am describing,
but that's something you might easily check by having
isolationtester sleep before PQconsumeInput on line 861.


-- 
Mark Dilger



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [PATCH] Addition of JetBrains project directory to .gitignore
Next
From: Tom Lane
Date:
Subject: Re: Allow relocatable extension to use @extschema@?