Thread: Re: Add isolation test template in injection_points for wait/wakeup/detach

Re: Add isolation test template in injection_points for wait/wakeup/detach

From
Bertrand Drouvot
Date:
Hi,

On Fri, Oct 18, 2024 at 07:44:08AM +0900, Michael Paquier wrote:
> Hi all,
> 
> This was on my TODO bucket for some time.  The isolation test
> "inplace" in the isolation test suite for injection_points relies on a
> couple of behaviors implemented in the backend.  One of them is that a
> point detach should not affect a wait.  Rather than having to guess
> this stuff from the sole isolation test we have in this module, I
> would like to add a simple test to document all these assumptions.

> This also works as a simple template for a newcomer willing to
> implement an isolation test with injection points.  That's less
> intimidating than the inplace test with its VACUUM and GRANT
> interactions, based on an injection point run when updating the stats
> of a relation in VACUUM.

> 
> Thoughts?

I think that it makes sense to "expose" the basics in a dedicated test. 

I thought that template.spec could have been another "naming" option but
basic.spec is better I think.

The test is pretty straightforward so I don't have that much to say, just one Nit:

1 ===

+# Detach before wait does not cause a wait.

What about?

Detach before wait does not cause a wait (and so wakeup produces an error)

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add isolation test template in injection_points for wait/wakeup/detach

From
Bertrand Drouvot
Date:
Hi,

On Fri, Oct 25, 2024 at 09:33:12AM +0900, Michael Paquier wrote:
> This is in the first permutation of the test done with "wait1 wakeup2
> detach2", and the diff means that the backend running the "wait"
> callback is reported as finished after the detach is done,
> injection_points_run being only used for the waits.  Hence the wait is
> so slow to finish that the detach has time to complete and finish,
> breaking the output.

Yeah, I agree with your analysis.

> And here comes the puzzling part: all of failures involve FreeBSD 13
> in the CI.  Reproducing this failure would not be difficult, I thought
> first; we can add a hardcoded pg_usleep() to delay the end of
> injection_wait() so as we make sure that the backend doing the wait
> reports later than the detach.  Or just do the same at the end of
> injection_points_run() once the callback exits.  I've sure done that,
> placing some strategic pg_usleep() calls on Linux to make the paths
> that matter in the wait slower, but the test remains stable.

Right, I did the same and observed the exact same behavior.

Still, it's possible to observe the s1 wait finishing after the s2 detach by
running this test manually (means create 2 sessions and run the test commands
manually) and: 

1. attach a debuger on the first session (say with a break point in injection_wait()).
or
2. add a hardcoded pg_usleep() in injection_wait()

So I think that the s1 wait finishing after the s2 detach is possible if
the session 1 is "freezed" (gdb case) or slow enough (pg_usleep() case).

> The CI
> on Linux is stable as well: 3rd and 4th columns of the cfbot are
> green, I did not spot any failures related to this isolation test in
> injection_points.  Only the second column about FreeBSD is going rogue
> on a periodic basis.
> 
> One fix would is to remove this first permutation test, still that's
> hiding a problem rather than solving it, and something looks wrong
> with conditional variables, specific to FreeBSD?

Hum, we would probably observe other failures in other tests no?

> Any thoughts or comments?

I think that we can not be 100% sure that the s1 wait will finish before the
s2 detach (easy reproducible with gdb attached on s1 or an hardcoded sleep) and
that other OS could also report the test as failing for the same reason.

It's not ideal, but instead of removing this first permutation test what about
adding a "sleep2" step in it (doing say, SELECT pg_sleep(1);) and call this
new step before the detach2 one?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add isolation test template in injection_points for wait/wakeup/detach

From
Bertrand Drouvot
Date:
Hi,

On Tue, Oct 29, 2024 at 01:12:25PM +0900, Michael Paquier wrote:
> On Mon, Oct 28, 2024 at 07:17:28AM +0000, Bertrand Drouvot wrote:
> > I think that we can not be 100% sure that the s1 wait will finish before the
> > s2 detach (easy reproducible with gdb attached on s1 or an hardcoded sleep) and
> > that other OS could also report the test as failing for the same reason.
> 
> Yes, the only safe thing we can do in this test is to let the wakeup2
> be last, as we are sure that the isolationtester is going to keep at
> s2 to finish the call of the wakeup function before moving on with
> checking the end of the wait.

Agree.

> > It's not ideal, but instead of removing this first permutation test what about
> > adding a "sleep2" step in it (doing say, SELECT pg_sleep(1);) and call this
> > new step before the detach2 one?
> 
> There is no real guarantee of stability.

Right.

> Under a wait of N seconds,
> we could still have environments where the wait() could remain stuck
> more than N seconds between the moment the condition variable is woken
> up and the result of the wait() is reported back to the client.  And
> hardcoded sleeps make the test slower even on fast machines.

Yeah, not ideal so better to keep the test as it is currently (after your
refactoring).

> What we have here seems like just contention of Cirrus with the
> FreeBSD hosts while there is much more stability with the linux hosts.

That sounds like it, agree.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add isolation test template in injection_points for wait/wakeup/detach

From
Bertrand Drouvot
Date:
Hi,

On Tue, Oct 29, 2024 at 05:20:42PM +0900, Michael Paquier wrote:
> On Tue, Oct 29, 2024 at 07:59:31AM +0000, Bertrand Drouvot wrote:
> > On Tue, Oct 29, 2024 at 01:12:25PM +0900, Michael Paquier wrote:
> >> Under a wait of N seconds,
> >> we could still have environments where the wait() could remain stuck
> >> more than N seconds between the moment the condition variable is woken
> >> up and the result of the wait() is reported back to the client.  And
> >> hardcoded sleeps make the test slower even on fast machines.
> > 
> > Yeah, not ideal so better to keep the test as it is currently (after your
> > refactoring).
> 
> Backpedalling a bit on that..  Perhaps it would be useful to keep the 
> permutation as disabled in the test and to update the comment in
> basic.spec to tell that this had better be avoided when implementing a
> test if the timing of the return value of a SQL wait() cannot be
> strictly controlled?  And also say that putting the wakeup() should be
> last for simple permutations?  The point of this spec is also to
> document some good and bad practices.

Yeah, agree that it would make sense to document in the test what has been
discovered here.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add isolation test template in injection_points for wait/wakeup/detach

From
Bertrand Drouvot
Date:
Hi,

On Wed, Oct 30, 2024 at 12:40:05PM +0900, Michael Paquier wrote:
> On Tue, Oct 29, 2024 at 02:06:11PM +0000, Bertrand Drouvot wrote:
> > Yeah, agree that it would make sense to document in the test what has been
> > discovered here.
> 
> How about something like the attached?

Thanks!

A few comments:

1 === Nit

s/Permutations like this one/Permutations like the following commented out one/ ?

2 ===

s/back its result/back its result once woken up/ ?

3 ===

s/faster than the wait/before the SQL function doing the wait returns its result/ ?

With 2 === and 3 === I think it's easier to "link" both sentences that way.

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add isolation test template in injection_points for wait/wakeup/detach

From
Bertrand Drouvot
Date:
Hi,

On Wed, Oct 30, 2024 at 06:21:04PM +0900, Michael Paquier wrote:
> On Wed, Oct 30, 2024 at 07:26:07AM +0000, Bertrand Drouvot wrote:
> > 1 === Nit
> > 
> > s/Permutations like this one/Permutations like the following commented out one/ ?
> > 
> > 2 ===
> > 
> > s/back its result/back its result once woken up/ ?
> > 
> > 3 ===
> > 
> > s/faster than the wait/before the SQL function doing the wait returns its result/ ?
> > 
> > With 2 === and 3 === I think it's easier to "link" both sentences that way.
> 
> What do you think about the updated version attached?

Thanks! LGTM.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com