Thread: POC: Better infrastructure for automated testing of concurrency issues

POC: Better infrastructure for automated testing of concurrency issues

From

Alexander Korotkov

Date:

25 November 2020, 17:10:54

Hackers,

PostgreSQL is a complex multi-process system, and we are periodically faced with complicated concurrency issues. While the postgres community does a great job on investigating and fixing the problems, our ability to reproduce concurrency issues in the source code test suites is limited.

I think we currently have two general ways to reproduce the concurrency issues.
1. A text scenario for manual reproduction of the issue, which could involve psql sessions, gdb sessions etc. Couple of examples are [1] and [2]. This method provides reliable reproduction of concurrency issues. But it's hard to automate, because it requires external instrumentation (debugger) and it's not stable in terms of postgres code changes (that is particular line numbers for breakpoints could be changed). I think this is why we currently don't have such scenarios among postgres test suites.
2. Another way is to reproduce the concurrency issue without directly touching the database internals using pgbench or other way to simulate the workload (see [3] for example). This way is easier to automate, because it doesn't need external instrumentation and it's not so sensitive to source code changes. But at the same time this way is not reliable and is resource-consuming.

In the view of above, I'd like to propose a POC patch, which implements new builtin infrastructure for reproduction of concurrency issues in automated test suites. The general idea is so-called "stop events", which are special places in the code, where the execution could be stopped on some condition. Stop event also exposes a set of parameters, encapsulated into jsonb value. The condition over stop event parameters is defined using jsonpath language.

Following functions control behavior –
* pg_stopevent_set(stopevent_name, jsonpath_conditon) – sets condition for the stop event. Once the function is executed, all the backends, which run a given stop event with parameters satisfying the given jsonpath condition, will be stopped.
* pg_stopevent_reset(stopevent_name) – resets stop events. All the backends previously stopped on a given stop event will continue the execution.

For sure, evaluation of stop events causes a CPU overhead. This is why it's controlled by enable_stopevents GUC, which is off by default. I expect the overhead with enable_stopevents = off shouldn't be observable. Even if it would be observable, we could enable stop events only by specific configure parameter. There is also trace_stopevents GUC, which traces all the stop events to the log with debug2 level.

In the code stop events are defined using macro STOPEVENT(event_id, params). The 'params' should be a function call, and it's evaluated only if stop events are enabled. pg_isolation_test_session_is_blocked() takes stop events into account. So, stop events are suitable for isolation tests.

POC patch comes with a sample isolation test in src/test/isolation/specs/gin-traverse-deleted-pages.spec, which reproduces the issue described in [2] (gin scan steps to the page concurrently deleted by vacuum).

From my point of view, stop events would open great possibilities to improve coverage of concurrency issues. They allow us to reliably test concurrency issues in both isolation and tap test suites. And such test suites don't take extraordinary resources for execution. The main cost here is maintaining a set of stop events among the codebase. But I think this cost is justified by much better coverage of concurrency issues.

The feedback is welcome.

Links.
1. https://www.postgresql.org/message-id/4E1DE580.1090905%40enterprisedb.com
2. https://www.postgresql.org/message-id/CAPpHfdvMvsw-NcE5bRS7R1BbvA4BxoDnVVjkXC5W0Czvy9LVrg%40mail.gmail.com
3. https://www.postgresql.org/message-id/BF9B38A4-2BFF-46E8-BA87-A2D00A8047A6%40hintbits.com

------

Regards,
Alexander Korotkov

Attachment

0001-Stopevents-v1.patch

Re: POC: Better infrastructure for automated testing of concurrency issues

From

Alvaro Herrera

Date:

04 December 2020, 21:29:51

On 2020-Nov-25, Alexander Korotkov wrote:

> In the view of above, I'd like to propose a POC patch, which implements new
> builtin infrastructure for reproduction of concurrency issues in automated
> test suites.  The general idea is so-called "stop events", which are
> special places in the code, where the execution could be stopped on some
> condition.  Stop event also exposes a set of parameters, encapsulated into
> jsonb value.  The condition over stop event parameters is defined using
> jsonpath language.

+1 for the idea.  I agree we have a need for something on this area;
there are *many* scenarios currently untested because of the lack of
what you call "stop points".  I don't know if jsonpath is the best way
to implement it, but at least it is readily available and it seems a
decent way to go at it.

Re: POC: Better infrastructure for automated testing of concurrency issues

From

Peter Geoghegan

Date:

04 December 2020, 21:57:13

On Wed, Nov 25, 2020 at 6:11 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> While the postgres community does a great job on investigating and fixing the problems, our ability to reproduce
concurrencyissues in the source code test suites is limited.

+1. This seems really cool.

> For sure, evaluation of stop events causes a CPU overhead. This is why it's controlled by enable_stopevents GUC,
whichis off by default. I expect the overhead with enable_stopevents = off shouldn't be observable. Even if it would
beobservable, we could enable stop events only by specific configure parameter. There is also trace_stopevents GUC,
whichtraces all the stop events to the log with debug2 level.

But why even risk adding noticeable overhead when "enable_stopevents =
off "? Even if it's a very small risk? We can still get most of the
benefit by enabling it only on certain builds and buildfarm animals.
It will be a bit annoying to not have stop events enabled in all
builds, but it avoids the problem of even having to think about the
overhead, now or in the future. I think that that trade-off is a good
one. Even if the performance trade-off is judged perfectly for the
first few tests you add, what are the chances that it will stay that
way as the infrastructure is used in more and more places? What if you
need to add a test to the back branches? Since we don't anticipate any
direct benefit for users (right?), I think that this question is
simple.

I am not arguing for not enabling stop events on standard builds
because the infrastructure isn't useful -- it's *very* useful. Useful
enough that it would be nice to be able to use it extensively without
really thinking about the performance hit each time. I know that I'll
be *far* more likely to use it if I don't have to waste time and
energy on that aspect every single time.

--
Peter Geoghegan

Re: POC: Better infrastructure for automated testing of concurrency issues

From

Alexander Korotkov

Date:

05 December 2020, 00:15:15

On Fri, Dec 4, 2020 at 9:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2020-Nov-25, Alexander Korotkov wrote:
> > In the view of above, I'd like to propose a POC patch, which implements new
> > builtin infrastructure for reproduction of concurrency issues in automated
> > test suites.  The general idea is so-called "stop events", which are
> > special places in the code, where the execution could be stopped on some
> > condition.  Stop event also exposes a set of parameters, encapsulated into
> > jsonb value.  The condition over stop event parameters is defined using
> > jsonpath language.
>
> +1 for the idea.  I agree we have a need for something on this area;
> there are *many* scenarios currently untested because of the lack of
> what you call "stop points".  I don't know if jsonpath is the best way
> to implement it, but at least it is readily available and it seems a
> decent way to go at it.

Thank you for your feedback.  I agree with you regarding jsonpath.  My
initial idea was to use the executor expressions.  But executor
expressions require serialization/deserialization, while stop points
need to work cross-database or even with processes not connected to
any database (such as checkpointer, background writer etc).  That
leads to difficulties, while jsonpath appears to be very easy for this
use-case.

------
Regards,
Alexander Korotkov

Re: POC: Better infrastructure for automated testing of concurrency issues

From

Alexander Korotkov

Date:

05 December 2020, 00:20:27

On Fri, Dec 4, 2020 at 9:57 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Nov 25, 2020 at 6:11 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > While the postgres community does a great job on investigating and fixing the problems, our ability to reproduce
concurrencyissues in the source code test suites is limited. 
>
> +1. This seems really cool.
>
> > For sure, evaluation of stop events causes a CPU overhead.  This is why it's controlled by enable_stopevents GUC,
whichis off by default. I expect the overhead with enable_stopevents = off shouldn't be observable.  Even if it would
beobservable, we could enable stop events only by specific configure parameter.  There is also trace_stopevents GUC,
whichtraces all the stop events to the log with debug2 level. 
>
> But why even risk adding noticeable overhead when "enable_stopevents =
> off "? Even if it's a very small risk? We can still get most of the
> benefit by enabling it only on certain builds and buildfarm animals.
> It will be a bit annoying to not have stop events enabled in all
> builds, but it avoids the problem of even having to think about the
> overhead, now or in the future. I think that that trade-off is a good
> one. Even if the performance trade-off is judged perfectly for the
> first few tests you add, what are the chances that it will stay that
> way as the infrastructure is used in more and more places? What if you
> need to add a test to the back branches? Since we don't anticipate any
> direct benefit for users (right?), I think that this question is
> simple.
>
> I am not arguing for not enabling stop events on standard builds
> because the infrastructure isn't useful -- it's *very* useful. Useful
> enough that it would be nice to be able to use it extensively without
> really thinking about the performance hit each time. I know that I'll
> be *far* more likely to use it if I don't have to waste time and
> energy on that aspect every single time.

Thank you for your feedback.  We probably can't think over everything
in advance.  We can start with configure option enabled for developers
and some buildfarm animals.  That causes no risk of overhead in
standard builds.  After some time, we may reconsider to enable stop
events even in standard build if we see they cause no regression.

------
Regards,
Alexander Korotkov

Re: POC: Better infrastructure for automated testing of concurrency issues

From

Peter Geoghegan

Date:

05 December 2020, 01:00:48

On Fri, Dec 4, 2020 at 1:20 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> Thank you for your feedback.  We probably can't think over everything
> in advance.  We can start with configure option enabled for developers
> and some buildfarm animals.  That causes no risk of overhead in
> standard builds.  After some time, we may reconsider to enable stop
> events even in standard build if we see they cause no regression.

I'll start using the configure option for debug builds only as soon as
possible. It will easily work with my existing workflow.

I don't know about anyone else, but for me this is only a very small
inconvenience. Whereas the convenience of not having to think about
the performance impact seems huge.

-- 
Peter Geoghegan

Re: POC: Better infrastructure for automated testing of concurrency issues

From

Craig Ringer

Date:

07 December 2020, 09:09:35

On Wed, 25 Nov 2020 at 22:11, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hackers,

PostgreSQL is a complex multi-process system, and we are periodically faced with complicated concurrency issues. While the postgres community does a great job on investigating and fixing the problems, our ability to reproduce concurrency issues in the source code test suites is limited.

I think we currently have two general ways to reproduce the concurrency issues.
1. A text scenario for manual reproduction of the issue, which could involve psql sessions, gdb sessions etc. Couple of examples are [1] and [2]. This method provides reliable reproduction of concurrency issues. But it's hard to automate, because it requires external instrumentation (debugger) and it's not stable in terms of postgres code changes (that is particular line numbers for breakpoints could be changed). I think this is why we currently don't have such scenarios among postgres test suites.
2. Another way is to reproduce the concurrency issue without directly touching the database internals using pgbench or other way to simulate the workload (see [3] for example). This way is easier to automate, because it doesn't need external instrumentation and it's not so sensitive to source code changes. But at the same time this way is not reliable and is resource-consuming.

Agreed.

For a useful but limited set of cases there's (3) the isolation tester and pg_isolation_regress. But IIRC the patches to teach it to support multiple upstream nodes never got in, so it's essentially useless for any replication related testing.

There's also (4), write a TAP test that uses concurrent psql sessions via IPC::Run. Then play games with heavyweight or advisory lock waits to order events, use instance starts/stops, change ports or connstrings to simulate network issues, use SIGSTOP/SIGCONTs, add src/test/modules extensions that inject faults or provide custom blocking wait functions for the event you want, etc. I've done that more than I'd care to, and I don't want to do it any more than I have to in future.

In some cases I've gone further and written tests that use systemtap in "guru" mode (read/write, with embedded C enabled) to twiddle the memory of the target process(es) when a probe is hit, e.g. to modify a function argument or return value or inject a fault. Not exactly portable or convenient, though very powerful.

In the view of above, I'd like to propose a POC patch, which implements new builtin infrastructure for reproduction of concurrency issues in automated test suites. The general idea is so-called "stop events", which are special places in the code, where the execution could be stopped on some condition. Stop event also exposes a set of parameters, encapsulated into jsonb value. The condition over stop event parameters is defined using jsonpath language.

The patched PostgreSQL used by 2ndQuadrant internally has a feature called PROBE_POINT()s that is somewhat akin to this. Since it's not a customer facing feature I'm sure I can discuss it here, though I'll need to seek permission before I can show code.

TL;DR: PROBE_POINT()s let you inject ERRORs, sleeps, crashes, and various other behaviour at points in the code marked by name, using GUCs, hooks loaded from test extensions, or even systemtap scripts to control what fires and when. Performance impact is essentially zero when no probes are currently enabled at runtime, so they're fine for cassert builds.

Details:

A PROBE_POINT() is a macro that works as a marker, a bit like a TRACE_POSTGRESQL_.... dtrace macro. But instead of the super lightweight tracepoint that SDT marker points emit, a PROBE_POINT tests an unlikely(probe_points_enabled) flag, and if true, it prepares arguments for the probe handler: A probe name, probe action, sleep duration, and a hit counter.

The default probe action and sleep duration come from GUCs. So your control of the probe is limited to the granularity you can easily manage GUCs at. That's often sufficient

But if you want finer control for something, there are two ways to achieve it.

After picking the default arguments to the handler, the probe point checks for a hook. If defined, it calls it with the probe point name and pointers to the action and sleep duration values, so the hook function can modify them per probe-point hit. That way you can use in src/test/modules extensions or your own test extensions first, with the probe point name as an argument and the action and sleep duration as out-params, as well as any accessible global state, custom GUCs you define in your test extension, etc. That's usually enough to target a probe very specifically but it's a bit of a hassle.

Another option is to use a systemtap script. You can write your code in systemtap with its language. When the systemtap marker for a probe point event fires, decide if it's the one you want and twiddle the target process variables that store the probe action and sleep duration from the systemtap script. I find this much more convenient for day to day testing, but because of systemtap portability challenges I don't find it as useful for writing regression tests for repeat use.

A PROBE_POINT() actually emits dtrace/perf SDT markers if postgres was compiled with --enable-dtrace too, so you can use them with perf, systemtap, bpftrace or whatever for read-only use. Including optional arguments to the probe point. Exactly as if it was a TRACE_POSTGRESQL_foo point, but without needing to hack probes.d for each one.

The PROBE_POINT() implementation can fake signal delivery with signal actions, which has been handy too.

I also have a version of the code that takes arguments to the PROBE_POINT() and passes them to the handler function as a va_list too, with a compile-time-generated array of argument types inferred by C11 _Generic as the first argument. So your handler function can be passed probe-point-specific contextual info like the current xid being committed or whatever. This isn't currently deployed.

The advantage of the PROBE_POINT() approach has been that it's generally very cheap to check whether a probe point should fire, and it's basically free to skip them if there are no probe points enabled right now. If we hashed the probe point names for the initial comparisons it'd be faster still.

I will seek approval to share the relevant code.

Following functions control behavior –
* pg_stopevent_set(stopevent_name, jsonpath_conditon) – sets condition for the stop event. Once the function is executed, all the backends, which run a given stop event with parameters satisfying the given jsonpath condition, will be stopped.
* pg_stopevent_reset(stopevent_name) – resets stop events. All the backends previously stopped on a given stop event will continue the execution.

Does that offer any way to affect early startup, late shutdown, servers in warm standby, etc? Or for that matter, any way to manipulate bgworkers and auxprocs or the postmaster itself, things you can't run a query on directly?

Also, based on my experience using PROBE_POINT()s I would suggest that in addition to a stop or start "event", it's desirable to be able to elog(PANIC), elog(ERROR), elog(LOG), and/or sleep() for a certain duration. I've found all to be extremely useful.

In the code stop events are defined using macro STOPEVENT(event_id, params). The 'params' should be a function call, and it's evaluated only if stop events are enabled. pg_isolation_test_session_is_blocked() takes stop events into account.

Oooh, that I like.

PROBE_POINT()s don't do that, and it's annoying.

Re: POC: Better infrastructure for automated testing of concurrency issues

From

Alexander Korotkov

Date:

07 December 2020, 20:31:21

Hi!

On Mon, Dec 7, 2020 at 9:10 AM Craig Ringer
<craig.ringer@enterprisedb.com> wrote:
> On Wed, 25 Nov 2020 at 22:11, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>> PostgreSQL is a complex multi-process system, and we are periodically faced with complicated concurrency issues.
Whilethe postgres community does a great job on investigating and fixing the problems, our ability to reproduce
concurrencyissues in the source code test suites is limited. 
>>
>> I think we currently have two general ways to reproduce the concurrency issues.
>> 1. A text scenario for manual reproduction of the issue, which could involve psql sessions, gdb sessions etc. Couple
ofexamples are [1] and [2]. This method provides reliable reproduction of concurrency issues. But it's  hard to
automate,because it requires external instrumentation (debugger) and it's not stable in terms of postgres code changes
(thatis particular line numbers for breakpoints could be changed). I think this is why we currently don't have such
scenariosamong postgres test suites. 
>> 2. Another way is to reproduce the concurrency issue without directly touching the database internals using pgbench
orother way to simulate the workload (see [3] for example). This way is easier to automate, because it doesn't need
externalinstrumentation and it's not so sensitive to source code changes. But at the same time this way is not reliable
andis resource-consuming. 
>
> Agreed.
>
> For a useful but limited set of cases there's (3) the isolation tester and pg_isolation_regress. But IIRC the patches
toteach it to support multiple upstream nodes never got in, so it's essentially useless for any replication related
testing.
>
> There's also (4), write a TAP test that uses concurrent psql sessions via IPC::Run. Then play games with heavyweight
oradvisory lock waits to order events, use instance starts/stops, change ports or connstrings to simulate network
issues,use SIGSTOP/SIGCONTs, add src/test/modules extensions that inject faults or provide custom blocking wait
functionsfor the event you want, etc. I've done that more than I'd care to, and I don't want to do it any more than I
haveto in future. 

Sure, there are isolation tester and TAP tests.  I just meant the
scenarios, where we can't reliably reproduce using either isolation
tests or tap tests.  Sorry for confusion.

> In some cases I've gone further and written tests that use systemtap in "guru" mode (read/write, with embedded C
enabled)to twiddle the memory of the target process(es) when a probe is hit, e.g. to modify a function argument or
returnvalue or inject a fault. Not exactly portable or convenient, though very powerful. 

Exactly, systemtap is good, but we need something more portable and
convenient for builtin test suites.

>> In the view of above, I'd like to propose a POC patch, which implements new builtin infrastructure for reproduction
ofconcurrency issues in automated test suites.  The general idea is so-called "stop events", which are special places
inthe code, where the execution could be stopped on some condition.  Stop event also exposes a set of parameters,
encapsulatedinto jsonb value.  The condition over stop event parameters is defined using jsonpath language. 
>
>
> The patched PostgreSQL used by 2ndQuadrant internally has a feature called PROBE_POINT()s that is somewhat akin to
this.Since it's not a customer facing feature I'm sure I can discuss it here, though I'll need to seek permission
beforeI can show code. 
>
> TL;DR: PROBE_POINT()s let you inject ERRORs, sleeps, crashes, and various other behaviour at points in the code
markedby name, using GUCs, hooks loaded from test extensions, or even systemtap scripts to control what fires and when.
Performanceimpact is essentially zero when no probes are currently enabled at runtime, so they're fine for cassert
builds.
>
> Details:
>
> A PROBE_POINT() is a macro that works as a marker, a bit like a TRACE_POSTGRESQL_.... dtrace macro. But instead of
thesuper lightweight tracepoint that SDT marker points emit, a PROBE_POINT tests an unlikely(probe_points_enabled)
flag,and if true, it prepares arguments for the probe handler: A probe name, probe action, sleep duration, and a hit
counter.
>
> The default probe action and sleep duration come from GUCs. So your control of the probe is limited to the
granularityyou can easily manage GUCs at. That's often sufficient 
>
> But if you want finer control for something, there are two ways to achieve it.
>
> After picking the default arguments to the handler, the probe point checks for a hook. If defined, it calls it with
theprobe point name and pointers to the action and sleep duration values, so the hook function can modify them per
probe-pointhit. That way you can use in src/test/modules extensions or your own test extensions first, with the probe
pointname as an argument and the action and sleep duration as out-params, as well as any accessible global state,
customGUCs you define in your test extension, etc. That's usually enough to target a probe very specifically but it's a
bitof a hassle. 
>
> Another option is to use a systemtap script. You can write your code in systemtap with its language. When the
systemtapmarker for a probe point event fires, decide if it's the one you want and twiddle the target process variables
thatstore the probe action and sleep duration from the systemtap script. I find this much more convenient for day to
daytesting, but because of systemtap portability challenges I don't find it as useful for writing regression tests for
repeatuse. 
>
> A PROBE_POINT() actually emits dtrace/perf SDT markers if postgres was compiled with --enable-dtrace too, so you can
usethem with perf, systemtap, bpftrace or whatever for read-only use. Including optional arguments to the probe point.
Exactlyas if it was a TRACE_POSTGRESQL_foo point, but without needing to hack probes.d for each one. 
>
> The PROBE_POINT() implementation can fake signal delivery with signal actions, which has been handy too.
>
> I also have a version of the code that takes arguments to the PROBE_POINT() and passes them to the handler function
asa va_list too, with a compile-time-generated array of argument types inferred by C11 _Generic as the first argument.
Soyour handler function can be passed probe-point-specific contextual info like the current xid being committed or
whatever.This isn't currently deployed. 
>
> The advantage of the PROBE_POINT() approach has been that it's generally very cheap to check whether a probe point
shouldfire, and it's basically free to skip them if there are no probe points enabled right now. If we hashed the probe
pointnames for the initial comparisons it'd be faster still. 
>
> I will seek approval to share the relevant code.

It's nice to know that we've also worked in this direction.  I was a
bit surprised when I didn't find relevant patches published in the
mailing lists.  I hope you would be able to share the code, it would
be very nice to see.

>> Following functions control behavior –
>>  * pg_stopevent_set(stopevent_name, jsonpath_conditon) – sets condition for the stop event.  Once the function is
executed,all the backends, which run a given stop event with parameters satisfying the given jsonpath condition, will
bestopped. 
>>  * pg_stopevent_reset(stopevent_name) – resets stop events.  All the backends previously stopped on a given stop
eventwill continue the execution. 
>
>
> Does that offer any way to affect early startup, late shutdown, servers in warm standby, etc? Or for that matter, any
wayto manipulate bgworkers and auxprocs or the postmaster itself, things you can't run a query on directly? 

Using the current version of patch you can manipulate bgworkers and
auxprocs as soon as they're connected to the shmem.  We can write
queries from another backend and the setting affects the whole
cluster.  I'm planning to add the ability to access the process
information from the jsonpath condition.  So, we would be able to
choose which process to stop on the stop event.

Early startup, late shutdown, servers in warm standby are not
supported yet.  I think this could be done using GUCa and hooks +
custom extensions in the similar way you describe it for
PROBE_POINT().

Also, I don't think we need to support everything at once.  It would
be nice to get something simple as soon as we have a clear roadmap of
how to add the rest of the features later.

> Also, based on my experience using PROBE_POINT()s I would suggest that in addition to a stop or start "event", it's
desirableto be able to elog(PANIC), elog(ERROR), elog(LOG), and/or sleep() for a certain duration. I've found all to be
extremelyuseful. 
>
>> In the code stop events are defined using macro STOPEVENT(event_id, params).  The 'params' should be a function
call,and it's evaluated only if stop events are enabled.  pg_isolation_test_session_is_blocked() takes stop events into
account.
>
>
> Oooh, that I like.
>
> PROBE_POINT()s don't do that, and it's annoying.

Thank you for your feedback.  I'm looking forward if you can publish
the PROBE_POINT() work.

------
Regards,
Alexander Korotkov

Re: POC: Better infrastructure for automated testing of concurrency issues

From

Andrey Borodin

Date:

08 December 2020, 13:26:27

Hi Alexander!

> 25 нояб. 2020 г., в 19:10, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
>
> In the code stop events are defined using macro STOPEVENT(event_id, params).  The 'params' should be a function call,
andit's evaluated only if stop events are enabled.  pg_isolation_test_session_is_blocked() takes stop events into
account. So, stop events are suitable for isolation tests. 

Thanks for this infrastructure. Looks like a really nice way to increase test coverage of most difficult things.

Can we also somehow prove that test was deterministic? I.e. expect number of blocked backends (if known) or something
likethat. 
I'm not really sure it's useful, just an idea.

Thanks!

Best regards, Andrey Borodin.

Re: POC: Better infrastructure for automated testing of concurrency issues

From

Alexander Korotkov

Date:

08 December 2020, 13:41:56

On Tue, Dec 8, 2020 at 1:26 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 25 нояб. 2020 г., в 19:10, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
> >
> > In the code stop events are defined using macro STOPEVENT(event_id, params).  The 'params' should be a function
call,and it's evaluated only if stop events are enabled.  pg_isolation_test_session_is_blocked() takes stop events into
account. So, stop events are suitable for isolation tests. 
>
> Thanks for this infrastructure. Looks like a really nice way to increase test coverage of most difficult things.
>
> Can we also somehow prove that test was deterministic? I.e. expect number of blocked backends (if known) or something
likethat. 
> I'm not really sure it's useful, just an idea.

Thank you for your feedback!

I forgot to mention, patch comes with pg_stopevents() function which
returns rowset (stopevent text, condition jsonpath, waiters int[]).
Waiters is an array of pids of waiting processes.

Additionally, isolation tester checks if a particular backend is
waiting using pg_isolation_test_session_is_blocked(), which works with
stop events too.

------
Regards,
Alexander Korotkov

Re: POC: Better infrastructure for automated testing of concurrency issues

From

Alexander Korotkov

Date:

01 September 2022, 03:51:45

Hi!

On Tue, Feb 23, 2021 at 3:09 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Dec 8, 2020 at 2:42 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > Thank you for your feedback!
>
> It would be nice to use this patch to test things that are important
> but untested inside vacuumlazy.c, such as the rare
> HEAPTUPLE_DEAD/tupgone case (grep for "Ordinarily, DEAD tuples would
> have been removed by..."). Same is true of the closely related
> heap_prepare_freeze_tuple()/heap_tuple_needs_freeze() code.

I'll continue work on this patch.  The rebased patch is attached.  It
implements stop events as configure option (not runtime GUC option).

------
Regards,
Alexander Korotkov

Attachment

0001-Stopevents-v3.patch

Re: POC: Better infrastructure for automated testing of concurrency issues

From

Craig Ringer

Date:

18 October 2022, 04:06:14

On Tue, 23 Feb 2021 at 08:09, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Dec 8, 2020 at 2:42 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> Thank you for your feedback!

It would be nice to use this patch to test things that are important
but untested inside vacuumlazy.c, such as the rare
HEAPTUPLE_DEAD/tupgone case (grep for "Ordinarily, DEAD tuples would
have been removed by..."). Same is true of the closely related
heap_prepare_freeze_tuple()/heap_tuple_needs_freeze() code.

That's what the PROBE_POINT()s functionality I referenced is for, too.

The proposed stop events feature has finer grained control over when the events fire than PROBE_POINT()s do. That's probably the main limitation in the PROBE_POINT()s functionality right now - controlling it at runtime is a bit crude unless you opt for using a C test extension or a systemtap script, and both those have other downsides.

On the other hand, PROBE_POINT()s are lighter weight when not actively turned on, to the point where they can be included in production builds to facilitate support and runtime diagnostics. They interoperate very nicely with static tracepoint markers (SDTs), the TRACE_POSTGRESQL_FOO(...) stuff, so there's no need to yet another separate set of debug markers scattered through the code. They can perform a wider set of actions useful for testing and diagnostics - PANIC the current backend, self-deliver an arbitrary signal, force a LOG message, introduce an interruptible or uninterruptible sleep, send a message to the client if any (handy for regress tests), or fire an extension-defined callback function.

I'd like to find a way to get the best of both worlds if possible.

Rather than completely sidetrack the thread on this patch I posted the PROBE_POINT()s patch on a separate thread here.

Re: POC: Better infrastructure for automated testing of concurrency issues

From

"Gregory Stark (as CFM)"

Date:

28 March 2023, 21:44:21

On Wed, 31 Aug 2022 at 20:52, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> I'll continue work on this patch.  The rebased patch is attached.  It
> implements stop events as configure option (not runtime GUC option).

It looks like this patch isn't going to be ready this commitfest. And
it hasn't received much discussion since August 2022. If I'm wrong say
something but otherwise I'll mark it Returned With Feedback. It can be
resurrected (and moved to the next commitfest) when you're free to
work on it again.

-- 
Gregory Stark
As Commitfest Manager

Re: POC: Better infrastructure for automated testing of concurrency issues

From

Alexander Korotkov

Date:

28 March 2023, 22:37:48

On Tue, Mar 28, 2023 at 9:44 PM Gregory Stark (as CFM)
<stark.cfm@gmail.com> wrote:
> On Wed, 31 Aug 2022 at 20:52, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >
> > I'll continue work on this patch.  The rebased patch is attached.  It
> > implements stop events as configure option (not runtime GUC option).
>
> It looks like this patch isn't going to be ready this commitfest. And
> it hasn't received much discussion since August 2022. If I'm wrong say
> something but otherwise I'll mark it Returned With Feedback. It can be
> resurrected (and moved to the next commitfest) when you're free to
> work on it again.

I'm good with that.

------
Regards,
Alexander Korotkov