POC: Better infrastructure for automated testing of concurrency issues - Mailing list pgsql-hackers

From Alexander Korotkov
Subject POC: Better infrastructure for automated testing of concurrency issues
Date
Msg-id CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ@mail.gmail.com
Whole thread Raw
Responses Re: POC: Better infrastructure for automated testing of concurrency issues  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Re: POC: Better infrastructure for automated testing of concurrency issues  (Peter Geoghegan <pg@bowt.ie>)
Re: POC: Better infrastructure for automated testing of concurrency issues  (Craig Ringer <craig.ringer@enterprisedb.com>)
Re: POC: Better infrastructure for automated testing of concurrency issues  (Andrey Borodin <x4mmm@yandex-team.ru>)
List pgsql-hackers
Hackers,

PostgreSQL is a complex multi-process system, and we are periodically faced with complicated concurrency issues. While the postgres community does a great job on investigating and fixing the problems, our ability to reproduce concurrency issues in the source code test suites is limited.

I think we currently have two general ways to reproduce the concurrency issues.
1. A text scenario for manual reproduction of the issue, which could involve psql sessions, gdb sessions etc. Couple of examples are [1] and [2]. This method provides reliable reproduction of concurrency issues. But it's  hard to automate, because it requires external instrumentation (debugger) and it's not stable in terms of postgres code changes (that is particular line numbers for breakpoints could be changed). I think this is why we currently don't have such scenarios among postgres test suites.
2. Another way is to reproduce the concurrency issue without directly touching the database internals using pgbench or other way to simulate the workload (see [3] for example). This way is easier to automate, because it doesn't need external instrumentation and it's not so sensitive to source code changes. But at the same time this way is not reliable and is resource-consuming.

In the view of above, I'd like to propose a POC patch, which implements new builtin infrastructure for reproduction of concurrency issues in automated test suites.  The general idea is so-called "stop events", which are special places in the code, where the execution could be stopped on some condition.  Stop event also exposes a set of parameters, encapsulated into jsonb value.  The condition over stop event parameters is defined using jsonpath language.

Following functions control behavior –
 * pg_stopevent_set(stopevent_name, jsonpath_conditon) – sets condition for the stop event.  Once the function is executed, all the backends, which run a given stop event with parameters satisfying the given jsonpath condition, will be stopped.
 * pg_stopevent_reset(stopevent_name) – resets stop events.  All the backends previously stopped on a given stop event will continue the execution.

For sure, evaluation of stop events causes a CPU overhead.  This is why it's controlled by enable_stopevents GUC, which is off by default. I expect the overhead with enable_stopevents = off shouldn't be observable.  Even if it would be observable, we could enable stop events only by specific configure parameter.  There is also trace_stopevents GUC, which traces all the stop events to the log with debug2 level.

In the code stop events are defined using macro STOPEVENT(event_id, params).  The 'params' should be a function call, and it's evaluated only if stop events are enabled.  pg_isolation_test_session_is_blocked() takes stop events into account.  So, stop events are suitable for isolation tests.

POC patch comes with a sample isolation test in src/test/isolation/specs/gin-traverse-deleted-pages.spec, which reproduces the issue described in [2] (gin scan steps to the page concurrently deleted by vacuum).

From my point of view, stop events would open great possibilities to improve coverage of concurrency issues.  They allow us to reliably test concurrency issues in both isolation and tap test suites.  And such test suites don't take extraordinary resources for execution.  The main cost here is maintaining a set of stop events among the codebase.  But I think this cost is justified by much better coverage of concurrency issues.

The feedback is welcome.

Links.
1. https://www.postgresql.org/message-id/4E1DE580.1090905%40enterprisedb.com
2. https://www.postgresql.org/message-id/CAPpHfdvMvsw-NcE5bRS7R1BbvA4BxoDnVVjkXC5W0Czvy9LVrg%40mail.gmail.com
3. https://www.postgresql.org/message-id/BF9B38A4-2BFF-46E8-BA87-A2D00A8047A6%40hintbits.com

------
Regards,
Alexander Korotkov
Attachment

pgsql-hackers by date:

Previous
From: Greg Nancarrow
Date:
Subject: Re: Parallel plans and "union all" subquery
Next
From: Pavel Borisov
Date:
Subject: Re: Is postgres ready for 2038?