Thread: Adding facility for injection points (or probe points?) for more advanced tests

Adding facility for injection points (or probe points?) for more advanced tests

From
Michael Paquier
Date:
Hi all,

I don't remember how many times in the last few years when I've had to
hack the backend to produce a test case that involves a weird race
condition across multiple processes running in the backend, to be able
to prove a point or just test a fix (one recent case: 2b8e5273e949).
Usually, I come to hardcoding stuff for the following situations:
- Trigger a PANIC, to force recovery.
- A FATAL, to take down a session, or just an ERROR.
- palloc() failure injection.
- Sleep to slow down a code path.
- Pause and release with condition variable.

And, while that's helpful to prove a point on a thread, nothing comes
out of it in terms of regression test coverage in the tree because
these tests are usually too slow and expensive, as they usually rely
on hardcoded timeouts.  So that's pretty much attempting to emulate
what one would do with a debugger in a predictable way, without the
manual steps because human hands don't scale well.

The reason behind that is of course more advanced testing, to be able
to expand coverage when we have weird and non-deterministic race
issues to deal with, and the code becoming more complex every year
makes that even harder.  Fault and failure injection in specific paths
comes into mind, additionally, particularly if you manage complex
projects based on Postgres.

So, please find attached a patch set that introduces an in-core
facility to be able to set what I'm calling here an "injection point",
that consists of being able to register in shared memory a callback
that can be run within a defined location of the code.  It means that
it is not possible to trigger a callback before shared memory is set,
but I've faced far more the case where I wanted to trigger something
after shmem is set anyway.  Persisting an injection point across
restarts is also possible by adding some through an extension's shmem
hook, as we do for custom LWLocks for example, as long as a library is
loaded.

This will remind a bit of what Alexander Korotkov has proposed here:
https://www.postgresql.org/message-id/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com
Also, this is much closee to what Craig Ringer is mentioning here,
where it is named probe points, but I am using a minimal design that
allows to achieve the same:
https://www.postgresql.org/message-id/CAPpHfdsn-hzneYNbX4qcY5rnwr-BA1ogOCZ4TQCKQAw9qa48kA%40mail.gmail.com

A difference is that I don't really see a point in passing to the
callback triggered an area of data coming from the hash table itself,
as at the end a callback could just refer to an area in shared memory
or a static set of variables depending on what it wants, with one or
more injection points (say a location to set a state, and a second to
check it).  So, at the end, the problem comes down in my opinion to
two things:
- Possibility to trigger a condition defined by some custom code, in
the backend (core code or even out-of-core).
- Possibility to define a location in the code where a named point
would be checked.

0001 introduces three APIs to create, run, and drop injection points:
+extern void InjectionPointCreate(const char *name,
+                    InjectionPointCallback callback);
+extern void InjectionPointRun(const char *name);
+extern void InjectionPointDrop(const char *name);

Then one just needs to add a macro like that to trigger the callback
registered in the code to test:
INJECTION_POINT_RUN("String");
So the footprint in the core tree is not zero, but it is as minimal as
it can be.

I have added some documentation to explain that, as well.  I am not
wedded to the name proposed in the patch, so if you feel there is
better, feel free to propose ideas.

This facility is hidden behind a specific configure/Meson switch,
making it a no-op by default:
--enable-injection-points
-Dinjection_points={ true | false }

0002 is a test module to test these routines, that I have kept a
maximum simple to ease review of the basics proposed here.  This could
be extended further to propose more default modes with TAP tests on
its own, as I don't see a real point in having the SQL bits or some
common callbacks (like for the PANIC or the FATAL cases) in core.

Thoughts and comments are welcome.
--
Michael

Attachment


On Wed, Oct 25, 2023 at 9:43 AM Michael Paquier <michael@paquier.xyz> wrote:
Hi all,

I don't remember how many times in the last few years when I've had to
hack the backend to produce a test case that involves a weird race
condition across multiple processes running in the backend, to be able
to prove a point or just test a fix (one recent case: 2b8e5273e949).
Usually, I come to hardcoding stuff for the following situations:
- Trigger a PANIC, to force recovery.
- A FATAL, to take down a session, or just an ERROR.
- palloc() failure injection.
- Sleep to slow down a code path.
- Pause and release with condition variable.

+1 for the feature.

TWIMW, here[1] is an interesting talk from pgconf.in 2020 on the similar topic.


Regards,
Amul Sul 

Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Michael Paquier
Date:
On Wed, Oct 25, 2023 at 10:06:17AM +0530, Amul Sul wrote:
> +1 for the feature.
>
> TWIMW, here[1] is an interesting talk from pgconf.in 2020 on the similar
> topic.
>
> 1] https://pgconf.in/conferences/pgconfin2020/program/proposals/101

Right, this uses a shared hash table.  There is a patch from 2019 that
summarizes this presentation as well:
https://www.postgresql.org/message-id/CANXE4TdxdESX1jKw48xet-5GvBFVSq%3D4cgNeioTQff372KO45A%40mail.gmail.com

A different idea is that this patch could leverage a bgworker instead
of having a footprint in the postmaster.  FWIW, I think that my patch
is more flexible than the modes added by faultinjector.h (see 0001),
because the actions that can be taken should not be limited by the
core code: the point registered could just use what it wants as
callback, so an extension could register a custom thing as well.
--
Michael

Attachment

Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Nazir Bilal Yavuz
Date:
Hi,

On Wed, 25 Oct 2023 at 07:13, Michael Paquier <michael@paquier.xyz> wrote:
>
> Hi all,
>
> I don't remember how many times in the last few years when I've had to
> hack the backend to produce a test case that involves a weird race
> condition across multiple processes running in the backend, to be able
> to prove a point or just test a fix (one recent case: 2b8e5273e949).
> Usually, I come to hardcoding stuff for the following situations:
> - Trigger a PANIC, to force recovery.
> - A FATAL, to take down a session, or just an ERROR.
> - palloc() failure injection.
> - Sleep to slow down a code path.
> - Pause and release with condition variable.

I liked the idea; thanks for working on this!

What do you think about creating a function for updating the already
created injection point's callback or name (mostly callback)? For now,
you need to drop and recreate the injection point to change the
callback or the name.

Here is my code correctness review:

diff --git a/meson_options.txt b/meson_options.txt
+option('injection_points', type: 'boolean', value: true,
+  description: 'Enable injection points')
+

It is enabled by default while building with meson.


diff --git a/src/backend/utils/misc/injection_point.c
b/src/backend/utils/misc/injection_point.c
+    LWLockRelease(InjectionPointLock);
+
+    /* If not found, do nothing? */
+    if (!found)
+        return;

It would be good to log a warning message here.


I tried to compile that with -Dwerror=true -Dinjection_points=false
and got some errors (warnings):

injection_point.c: In function ‘InjectionPointShmemSize’:
injection_point.c:59:1: error: control reaches end of non-void
function [-Werror=return-type]

injection_point.c: At top level:
injection_point.c:32:14: error: ‘InjectionPointHashByName’ defined but
not used [-Werror=unused-variable]

test_injection_points.c: In function ‘test_injection_points_run’:
test_injection_points.c:69:21: error: unused variable ‘name’
[-Werror=unused-variable]


The test_injection_points test runs and passes although I set
-Dinjection_points=false. That could be misleading, IMO the test
should be skipped if Postgres is not compiled with the injection
points.

Regards,
Nazir Bilal Yavuz
Microsoft



Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Michael Paquier
Date:
On Mon, Nov 06, 2023 at 10:28:14PM +0300, Nazir Bilal Yavuz wrote:
> I liked the idea; thanks for working on this!

Thanks for the review.

> What do you think about creating a function for updating the already
> created injection point's callback or name (mostly callback)? For now,
> you need to drop and recreate the injection point to change the
> callback or the name.

I am not sure if that's worth the addition.  TBH, all the code I've
seen that would benefit from these APIs just set up a cluster,
register a few injection points with a module, and then run a set of
tests.  They can also remove points.  So I'm just aiming for simplest
for the moment.

> Here is my code correctness review:
>
> diff --git a/meson_options.txt b/meson_options.txt
> +option('injection_points', type: 'boolean', value: true,
> +  description: 'Enable injection points')
> +
>
> It is enabled by default while building with meson.

Indeed, fixed.

> diff --git a/src/backend/utils/misc/injection_point.c
> b/src/backend/utils/misc/injection_point.c
> +    LWLockRelease(InjectionPointLock);
> +
> +    /* If not found, do nothing? */
> +    if (!found)
> +        return;
>
> It would be good to log a warning message here.

I don't think that's a good idea.  If a code path defines a
INJECTION_POINT_RUN() we'd get spurious warnings except if a point is
always defined when the build switch is enabled.

> I tried to compile that with -Dwerror=true -Dinjection_points=false
> and got some errors (warnings):

Right, fixed these three.

> The test_injection_points test runs and passes although I set
> -Dinjection_points=false. That could be misleading, IMO the test
> should be skipped if Postgres is not compiled with the injection
> points.

The test suite has been using an alternate output, but perhaps you are
right that this has little value without the switch enabled anyway.
I've made the processing optional when the option is not used for
meson and ./configure (requires a variable in Makefile.global.in in
the latter case), removing the alternate output.

Please find v2.
--
Michael

Attachment

Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Nathan Bossart
Date:
On Tue, Nov 07, 2023 at 05:01:16PM +0900, Michael Paquier wrote:
> On Mon, Nov 06, 2023 at 10:28:14PM +0300, Nazir Bilal Yavuz wrote:
>> I liked the idea; thanks for working on this!

+1, this seems very useful.

> +#ifdef USE_INJECTION_POINTS
> +#define INJECTION_POINT_RUN(name) InjectionPointRun(name)
> +#else
> +#define INJECTION_POINT_RUN(name) ((void) name)
> +#endif

nitpick: Why is the non-injection-point version "(void) name"?  I see
"(void) true" used elsewhere for this purpose.

> +    <para>
> +     Here is an example of callback for
> +     <literal>InjectionPointCallback</literal>:
> +<programlisting>
> +static void
> +custom_injection_callback(const char *name)
> +{
> +    elog(NOTICE, "%s: executed custom callback", name);
> +}
> +</programlisting>

Why do we provide the name to the callback functions?

Overall, the design looks pretty good to me.  I think it's a good idea to
keep it simple to start with.  Since this is really only intended for
special tests that run in special builds, it seems like we ought to be able
to change it easily in the future as needed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Hi,

On 2023-10-25 13:13:38 +0900, Michael Paquier wrote:
> So, please find attached a patch set that introduces an in-core
> facility to be able to set what I'm calling here an "injection point",
> that consists of being able to register in shared memory a callback
> that can be run within a defined location of the code.  It means that
> it is not possible to trigger a callback before shared memory is set,
> but I've faced far more the case where I wanted to trigger something
> after shmem is set anyway.  Persisting an injection point across
> restarts is also possible by adding some through an extension's shmem
> hook, as we do for custom LWLocks for example, as long as a library is
> loaded.

I would like to see a few example tests using this facility - without that
it's a bit hard to judge how the impact on core code would be and how easy
tests are to write.

It also seems like there's a few bits and pieces missing to actually be able
to write interesting tests. It's one thing to be able to inject code, but what
you commonly want to do for tests is to actually wait for such a spot in the
code to be reached, then perhaps wait inside the "modified" code, and do
something else in the test script. But as-is a decent amount of C code would
need to be written to write such a test, from what I can tell?

Greetings,

Andres Freund



Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Michael Paquier
Date:
On Fri, Nov 10, 2023 at 02:44:25PM -0600, Nathan Bossart wrote:
> On Tue, Nov 07, 2023 at 05:01:16PM +0900, Michael Paquier wrote:
>> +#ifdef USE_INJECTION_POINTS
>> +#define INJECTION_POINT_RUN(name) InjectionPointRun(name)
>> +#else
>> +#define INJECTION_POINT_RUN(name) ((void) name)
>> +#endif
>
> nitpick: Why is the non-injection-point version "(void) name"?  I see
> "(void) true" used elsewhere for this purpose.

Or (void) 0.

>> +    <para>
>> +     Here is an example of callback for
>> +     <literal>InjectionPointCallback</literal>:
>> +<programlisting>
>> +static void
>> +custom_injection_callback(const char *name)
>> +{
>> +    elog(NOTICE, "%s: executed custom callback", name);
>> +}
>> +</programlisting>
>
> Why do we provide the name to the callback functions?

This is for the use of the same callback across multiple points, and
tracking the name of the event happening was making sense to me to
know which code path is being taken when a callback is called.  One
thing that I got in mind as well here is to be able to register custom
wait events based on the name of the callback taken, for example on a
condition variable, a latch or a named LWLock.

> Overall, the design looks pretty good to me.  I think it's a good idea to
> keep it simple to start with.  Since this is really only intended for
> special tests that run in special builds, it seems like we ought to be able
> to change it easily in the future as needed.

Yes, my first idea is to keep the initial design minimal and take the
temperature.  As far as I can see, there seem to not be any strong
objection with this basic design, still I agree that I need to show a
bit more code about its usability.  I have some SQL and recovery cases
where this is handy and these have piled over time, including at least
two/three of them with more basic APIs in the test module may make
sense in the initial batch of what I am proposing here.
--
Michael

Attachment

Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Michael Paquier
Date:
On Fri, Nov 10, 2023 at 06:32:27PM -0800, Andres Freund wrote:
> I would like to see a few example tests using this facility - without that
> it's a bit hard to judge how the impact on core code would be and how easy
> tests are to write.

Sure.  I was wondering if people would be interested in that first.

> It also seems like there's a few bits and pieces missing to actually be able
> to write interesting tests. It's one thing to be able to inject code, but what
> you commonly want to do for tests is to actually wait for such a spot in the
> code to be reached, then perhaps wait inside the "modified" code, and do
> something else in the test script. But as-is a decent amount of C code would
> need to be written to write such a test, from what I can tell?

Depends on what you'd want to achieve.  As I mentioned at the top of
the thread, error, fatal, panics, hardcoded waits are the most common
cases I've seen in the last years.  Conditional waits are not in the
main patch but these are simple to support done (I mean, as in the
0003 attached with a TAP example).

While on it, I have extended the patch in the hash table a library
name and a function name so as the callback is loaded each time an
injection point is run.  (Perhaps the list of callbacks already loaded
in a process should be saved in a session-level static list/array to
avoid loading the same callbacks again, not sure if that's worth doing
for a test facility assuming that the number of times a callback is
called in a single session is usually very limited.  Anyway, that
would be simple to add if people prefer this addition.)

Anyway, here is a short list of commits that could have taken benefit
from this facility.  There are is much more, but that's a list I
grabbed quickly from my notes:
1) 8a4237908c0f
2) cb0cca188072
3) 7863ee4def65 (See https://postgr.es/m/YnT/Y2sEYj7pyOdc@paquier.xyz
where an expensive TAP test was included, and I've seen users facing
this bug in real life).  Revert of the original is clean here as well.
The trick is simple: stop a restartpoint during a promotion, and let
the restartpoint finish after the promotion.
4) 409f9ca44713, where injecting an error would stress the consistency
of the data reset (mentioned an error injected at
https://postgr.es/m/YWZk6nmAzQZS4B/z@paquier.xyz).  This reverts
cleanly even today.
5) b4721f39505b, quite similar (mentioned an error injection exactly
here: https://postgr.es/m/20181011033810.GB23570@paquier.xyz).  This
one requires an error when a transaction is started, something can be
achieved if the error is triggered conditionally (note that hard
failure would prevent the transaction to begin with the initial
snapshot taken in InitPostgres, but the module could just use a static
variable to track that).

Among these, I have implemented two examples on top of the main patch
set in 0002 and 0003: 4) as a TAP test with replication commands and
an error injection, and 3) that relies on a custom wait event and a
conditional variable to make the test posted on the other thread
cheaper, with an injection point waiting for a condition variable in
the middle of a restartpoint in the checkpointer.  I don't mean to
necessarily include all that in the upstream tree, these are just here
for reference first.

3) is the most interesting in this set, for sure.  That was a nasty
problem, and some cheap coverage in the core tree could be really good
for it, so I'd like to propose for commit after more polishing.  The
test of the bug 3) I am referring to takes originally 30~45s to run
and it was unstable as it could timeout.  With an injection point it
takes 1~2s.  Note that test_injection_points gains a wait/wake logic
to be able to use condition variables to wait on the restartpoint of a
promoted standby).  Both tests are not shaped for prime day yet, but
that's enough for a set of examples IMHO to show what can be done.

Does it answer your questions?
--
Michael

Attachment

Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Alvaro Herrera
Date:
Hello,

Good stuff here, I also have a bunch of bugfix commits that ended up not
having a test because of the need for a debugger or other interaction,
so let's move forward.

I think the docs (and the macro/function naming) describe things
backwards.  In my mind, it is INJECTION_POINT_RUN() that creates the
injection point; then InjectionPointCreate() attaches something to it.
So I would rename the macro to just INJECTION_POINT() and the function
to InjectionPointAttach().  This way you're saying "attach function FN
from library L to the injection point P"; where P is an entity that is
being created by the INJECTION_POINT() call in the code.

You named the hash table InjectionPointHashByName, which seems weird.
Is there any *other* way to locate an injection point that is not by
name?

In this patch, injection points are instance-wide (because the hash
table is in shmem).  As soon as you install a callback to one point,
that callback will be fired in every session.  Maybe for some tests this
is OK (and in particular your TAP tests have them attached in one
->safe_psql call and then they hit a completely different session, which
wouldn't work if the attachments were process-local), but maybe one
would want them limited to some specific process.  Maybe give an
optional PID so that if any other process hits that injection point,
nothing happens?

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)



Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Michael Paquier
Date:
On Tue, Nov 14, 2023 at 02:11:50PM +0100, Alvaro Herrera wrote:
> Good stuff here, I also have a bunch of bugfix commits that ended up not
> having a test because of the need for a debugger or other interaction,
> so let's move forward.
>
> I think the docs (and the macro/function naming) describe things
> backwards.  In my mind, it is INJECTION_POINT_RUN() that creates the
> injection point; then InjectionPointCreate() attaches something to it.
> So I would rename the macro to just INJECTION_POINT() and the function
> to InjectionPointAttach().  This way you're saying "attach function FN
> from library L to the injection point P"; where P is an entity that is
> being created by the INJECTION_POINT() call in the code.

Okay.  I am not strongly attached to the terms used by the patch.  The
first WIP I wrote used completely different terms.

> You named the hash table InjectionPointHashByName, which seems weird.
> Is there any *other* way to locate an injection point that is not by
> name?

I am not sure what you mean here.  Names are kind of the most
portable and simplest thing I could think of.  Is there something else
you have in mind that would allow a mapping between a code path and
what should be run?  Perhaps that's useful in some cases, but you were
also thinking about an in-core API where it is possible to retrieve a
list of callbacks based on a library name and/or a function name?  I
didn't see a use for it, but why not.

> In this patch, injection points are instance-wide (because the hash
> table is in shmem).  As soon as you install a callback to one point,
> that callback will be fired in every session.  Maybe for some tests this
> is OK (and in particular your TAP tests have them attached in one
> ->safe_psql call and then they hit a completely different session, which
> wouldn't work if the attachments were process-local), but maybe one
> would want them limited to some specific process.  Maybe give an
> optional PID so that if any other process hits that injection point,
> nothing happens?

Yes, still not something that's required in the core APIs or an
initial batch.  This is something I've seen used and a central place
where the callbacks are registered allows that because the callback is
triggered based on a global state like a MyProcPid or a getpid(), so
it is possible to pass a condition to a callback when it is created
(or attached per your wording), with the condition maintained in a
shmem area that can be part of an extension module that defines the
callbacks (in test_injection_points).  One trick sometimes is to know
the PID beforehand, which may need a second wait point (for example)
to make a test deterministic so as a test script has the time to get
the PID of a running session (bgworkers included) before the process
has time to do anything critical for the scenario tested.

An extra thing is that this design can be extended so as it could be
possible to pass down to the callback execution a private pointer of
data, though that's bound to the code path running the injection
point (not in the initial patch).  Then it's up to the callback to
decide if it needs to do something or not (say, I don't want to run
this callback except if I am manipulating page N in an access method,
etc.).  The conditional complexity is pushed to the injection
callbacks, not the core routines in charge is finding a callback or
attaching/creating one.  I am not sure that it is a good idea to
enforce a specific conditional logic in the backend core code.
--
Michael

Attachment

Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Alvaro Herrera
Date:
On 2023-Nov-15, Michael Paquier wrote:

> On Tue, Nov 14, 2023 at 02:11:50PM +0100, Alvaro Herrera wrote:

> > You named the hash table InjectionPointHashByName, which seems weird.
> > Is there any *other* way to locate an injection point that is not by
> > name?
> 
> I am not sure what you mean here.  Names are kind of the most
> portable and simplest thing I could think of.

Oh, I think you're overthinking what my comment was.  I was saying, just
name it "InjectionPointsHash".  Since there appears to be no room for
another hash table for injection points, then there's no need to specify
that this one is the ByName hash.  I couldn't think of any other way to
organize the injection points either.

> > In this patch, injection points are instance-wide (because the hash
> > table is in shmem).
> 
> Yes, still not something that's required in the core APIs or an
> initial batch.

I agree that we can do the easy thing first and build it up later.  I
just hope we don't get too wedded on the current interface because of
lack of time in the current release that we get stuck with it.

> I am not sure that it is a good idea to enforce a specific conditional
> logic in the backend core code.

Agreed, let's get more experience on what other types of tests people
want to build, and how are things going to interact with each other.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No necesitamos banderas
 No reconocemos fronteras"                  (Jorge González)



Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Michael Paquier
Date:
On Wed, Nov 15, 2023 at 12:21:40PM +0100, Alvaro Herrera wrote:
> On 2023-Nov-15, Michael Paquier wrote:
> Oh, I think you're overthinking what my comment was.  I was saying, just
> name it "InjectionPointsHash".  Since there appears to be no room for
> another hash table for injection points, then there's no need to specify
> that this one is the ByName hash.  I couldn't think of any other way to
> organize the injection points either.

Aha, OK.  No problem, this was itching me as well but I didn't see an
argument with changing these names, so I've renamed things a bit more.

>> Yes, still not something that's required in the core APIs or an
>> initial batch.
>
> I agree that we can do the easy thing first and build it up later.  I
> just hope we don't get too wedded on the current interface because of
> lack of time in the current release that we get stuck with it.

One thing that I assume we will need with more advanced testing is the
possibility to pass down one (for a structure of data) or more
arguments to the callback a point is attached to.  For that, it is
possible to add more macros, like INJECTION_POINT_1ARG,
INJECTION_POINT_ARG(), etc. that use some (void *) pointers.  It would
be possible to add that even in stable branches, as need arises,
changing the structure of the shmem hash table if required to control
the way a callback is run.

At the end, I suspect that it is one of these things where we'll need
to adapt depending on what people want to do with this stuff.  FWIW, I
can do already quite a bit with this minimalistic design and an
external extension.  Custom wait events are also very handy to monitor
a wait.

>> I am not sure that it is a good idea to enforce a specific conditional
>> logic in the backend core code.
>
> Agreed, let's get more experience on what other types of tests people
> want to build, and how are things going to interact with each other.

I've worked more on polishing the core facility today, with 0001
introducing the backend-side facility.  One thing that I mentioned
lacking is a local cache for processes so as they don't load an
external callback more than once if run.  So I have added this local
cache.  When a point is scanned but not found, a previous cache entry
is cleared if any (this leaks but we don't have a way to unload stuff,
and for testing that's not a big deal).  I've renamed the routines to
use attach and detach as terms, and adopted the name you've suggested
for the macro.  The names around the hash table and its entries have
been changed to what you've suggested.  You are right, that's more
intuitive.

0002 is the test module for the basics, split into its own patch, with
a couple of tests for the local cache part.  0003 and 0004 have been
adjusted with the new SQL functions.  At the end, I'd like to propose
0004 as it's been a PITA for me and I don't want to break this case
again.  It needs more work and can be cheaper.  One more argument in
favor of it is the addition of condition variables to wait and wake
points (perhaps with even more variables?) in the test module.

If there is interest for 0003, I'm OK to work more on it as well, but
that's less important IMV.

Thoughts and comments are welcome, with a v4 series attached.
--
Michael

Attachment

Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Ashutosh Bapat
Date:
On Wed, Oct 25, 2023 at 9:43 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> Hi all,
>
> I don't remember how many times in the last few years when I've had to
> hack the backend to produce a test case that involves a weird race
> condition across multiple processes running in the backend, to be able
> to prove a point or just test a fix (one recent case: 2b8e5273e949).
> Usually, I come to hardcoding stuff for the following situations:
> - Trigger a PANIC, to force recovery.
> - A FATAL, to take down a session, or just an ERROR.
> - palloc() failure injection.
> - Sleep to slow down a code path.
> - Pause and release with condition variable.
>
> And, while that's helpful to prove a point on a thread, nothing comes
> out of it in terms of regression test coverage in the tree because
> these tests are usually too slow and expensive, as they usually rely
> on hardcoded timeouts.  So that's pretty much attempting to emulate
> what one would do with a debugger in a predictable way, without the
> manual steps because human hands don't scale well.
>
> The reason behind that is of course more advanced testing, to be able
> to expand coverage when we have weird and non-deterministic race
> issues to deal with, and the code becoming more complex every year
> makes that even harder.  Fault and failure injection in specific paths
> comes into mind, additionally, particularly if you manage complex
> projects based on Postgres.
>
> So, please find attached a patch set that introduces an in-core
> facility to be able to set what I'm calling here an "injection point",
> that consists of being able to register in shared memory a callback
> that can be run within a defined location of the code.  It means that
> it is not possible to trigger a callback before shared memory is set,
> but I've faced far more the case where I wanted to trigger something
> after shmem is set anyway.  Persisting an injection point across
> restarts is also possible by adding some through an extension's shmem
> hook, as we do for custom LWLocks for example, as long as a library is
> loaded.
>
> This will remind a bit of what Alexander Korotkov has proposed here:
> https://www.postgresql.org/message-id/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com
> Also, this is much closee to what Craig Ringer is mentioning here,
> where it is named probe points, but I am using a minimal design that
> allows to achieve the same:
> https://www.postgresql.org/message-id/CAPpHfdsn-hzneYNbX4qcY5rnwr-BA1ogOCZ4TQCKQAw9qa48kA%40mail.gmail.com
>
> A difference is that I don't really see a point in passing to the
> callback triggered an area of data coming from the hash table itself,
> as at the end a callback could just refer to an area in shared memory
> or a static set of variables depending on what it wants, with one or
> more injection points (say a location to set a state, and a second to
> check it).  So, at the end, the problem comes down in my opinion to
> two things:
> - Possibility to trigger a condition defined by some custom code, in
> the backend (core code or even out-of-core).
> - Possibility to define a location in the code where a named point
> would be checked.
>
> 0001 introduces three APIs to create, run, and drop injection points:
> +extern void InjectionPointCreate(const char *name,
> +                    InjectionPointCallback callback);
> +extern void InjectionPointRun(const char *name);
> +extern void InjectionPointDrop(const char *name);
>
> Then one just needs to add a macro like that to trigger the callback
> registered in the code to test:
> INJECTION_POINT_RUN("String");
> So the footprint in the core tree is not zero, but it is as minimal as
> it can be.
>
> I have added some documentation to explain that, as well.  I am not
> wedded to the name proposed in the patch, so if you feel there is
> better, feel free to propose ideas.

Actually with Attach and Detach terminology, INJECTION_POINT becomes
the place where we "declare" the injection point. So the documentation
needs to first explain INJECTION_POINT and then explain the other
operations.

>
> This facility is hidden behind a specific configure/Meson switch,
> making it a no-op by default:
> --enable-injection-points
> -Dinjection_points={ true | false }

That's useful, but we will also see demand to enable those in
production (under controlled circumstances). But let the functionality
mature under a separate flag and developer builds before used in
production.

>
> 0002 is a test module to test these routines, that I have kept a
> maximum simple to ease review of the basics proposed here.  This could
> be extended further to propose more default modes with TAP tests on
> its own, as I don't see a real point in having the SQL bits or some
> common callbacks (like for the PANIC or the FATAL cases) in core.
>
> Thoughts and comments are welcome.

I think this is super useful functionality. Some comments here.

+/*
+ * Local cache of injection callbacks already loaded, stored in
+ * TopMemoryContext.
+ */
+typedef struct InjectionPointArrayEntry
+{
+ char name[INJ_NAME_MAXLEN];
+ InjectionPointCallback callback;
+} InjectionPointArrayEntry;
+
+static InjectionPointArrayEntry *InjectionPointCacheArray = NULL;

Initial size of this array is small, but given the size of code in a
given path to be tested, I fear that this may not be sufficient. I
feel it's better to use hash table whose APIs are already available.


+ test_injection_points_attach
+------------------------------
+
+(1 row)
+
+SELECT test_injection_points_run('TestInjectionBooh'); -- nothing

I find it pretty useless to expose that as a SQL API. Also adding it
in tests would make it usable as an example. In this patchset
INJECTION_POINT should be the only way to trigger an injection point.

That also brings another question. The INJECTION_POINT needs to be added in the
core code but can only be used through an extension. I think there should be
some rudimentary albeit raw test in core for this. Extension does some good
things like provides built-in actions when the injection is triggered. So, we
should keep those too.

I am glad that it covers the frequently needed injections error, notice and
wait. If someone adds multiple injection points for wait and all of them are
triggered (in different backends), test_injection_points_wake() would
wake them all. When debugging cascaded functionality it's not easy to
follow sequence add, trigger, wake for multiple injections. We should
associate a conditional variable with the required injection points. Attach
should create conditional variable in the shared memory for that injection
point and detach should remove it. I see this being mentioned in the commit
message, but I think it's something we need in the first version of "wait" to
be useful.

At some point we may want to control how many times an injection is
triggered by using a count. Often, I have wanted an ERROR to be thrown
in a code path once or twice and then stop triggering it. For example
to test WAL sender restart after a certain event. We can't really time
Detach correctly to avoid multiple restarts. A count is useful is such
a case.

+/*
+ * Attach a new injection point.
+ */
+void
+InjectionPointAttach(const char *name,
+ const char *library,
+ const char *function)
+{
+#ifdef USE_INJECTION_POINTS

In a non-injection-point build this function would be compiled and a call to
this function would throw an error. This means that any test we write which
uses injection points can not do so optionally. Tests which can be run with and
without injection points built will reduce duplication. We should define this
function as no-op in non-injection-point build to faciliate such tests.

Those tests need to also install extension. That's another pain point.
So anyone wants to run the tests needs to compile the extension too. I
am wondering whether we should have this functionality in the core
itself somewhere and will be only useful when built with injection.

Many a times it's only a single backend which needs to be subjected to
an injection. For inducing ERROR and NOTICE, many a times it's also
the same backend attached the client session. For WAIT, however we
need a way to inject from some other session. We might be able to use
current signalling mechanism for that (wake sends SIGUSR1 with
reason). Leaving aside WAIT for a moment when the same backend's
behaviour is being controlled, do we really need shared memory and
also affect all the running backends. I see some discussion about
being able to trigger only for a given PID, but when that PID of the
current backend itself shared memory is not required.

--
Best Wishes,
Ashutosh Bapat



Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Michael Paquier
Date:
On Mon, Nov 20, 2023 at 04:53:45PM +0530, Ashutosh Bapat wrote:
> On Wed, Oct 25, 2023 at 9:43 AM Michael Paquier <michael@paquier.xyz> wrote:
>> I have added some documentation to explain that, as well.  I am not
>> wedded to the name proposed in the patch, so if you feel there is
>> better, feel free to propose ideas.
>
> Actually with Attach and Detach terminology, INJECTION_POINT becomes
> the place where we "declare" the injection point. So the documentation
> needs to first explain INJECTION_POINT and then explain the other
> operations.

Sure.

>> This facility is hidden behind a specific configure/Meson switch,
>> making it a no-op by default:
>> --enable-injection-points
>> -Dinjection_points={ true | false }
>
> That's useful, but we will also see demand to enable those in
> production (under controlled circumstances). But let the functionality
> mature under a separate flag and developer builds before used in
> production.

Hmm.  Okay.  I'd still keep that under a compile switch for now
anyway to keep a cleaner separation and avoid issues where it would be
possible to inject code in a production build.  Note that I was
planning to switch one of my buildfarm animals to use it should this
stuff make it into the tree.  And people would be free to use it if
they want.  If in production, that would be a risk, IMO.

> +/*
> + * Local cache of injection callbacks already loaded, stored in
> + * TopMemoryContext.
> + */
> +typedef struct InjectionPointArrayEntry
> +{
> + char name[INJ_NAME_MAXLEN];
> + InjectionPointCallback callback;
> +} InjectionPointArrayEntry;
> +
> +static InjectionPointArrayEntry *InjectionPointCacheArray = NULL;
>
> Initial size of this array is small, but given the size of code in a
> given path to be tested, I fear that this may not be sufficient. I
> feel it's better to use hash table whose APIs are already available.

I've never seen in recent years a need for a given test to use more
than 4~5 points.  But I guess that you've seen more than that wanted
in a prod environment with a fork of Postgres?

I'm OK to switch that to use a hash table in the initial
implementation, even for a low number of entries with points that are
not in hot code paths that should be OK.  At least for my cases
related to testing that's OK.  Am I right to assume that you mean a
HTAB in TopMemoryContext?

> I find it pretty useless to expose that as a SQL API. Also adding it
> in tests would make it usable as an example. In this patchset
> INJECTION_POINT should be the only way to trigger an injection point.

That's useful to test the cache logic while providing simple coverage
for the core API, and that's cheap.  So I'd rather keep this test
module around with these tests.

> That also brings another question. The INJECTION_POINT needs to be added in the
> core code but can only be used through an extension. I think there should be
> some rudimentary albeit raw test in core for this. Extension does some good
> things like provides built-in actions when the injection is triggered. So, we
> should keep those too.

Yeah, I'd like to agree with that but everything I've seen in the
recent years is that every setup finishes by having different
assumptions about what they want to do, so my intention is to trim
down the core of the patch to a bare acceptable minimum and then work
on top of that.

> I am glad that it covers the frequently needed injections error, notice and
> wait. If someone adds multiple injection points for wait and all of them are
> triggered (in different backends), test_injection_points_wake() would
> wake them all. When debugging cascaded functionality it's not easy to
> follow sequence add, trigger, wake for multiple injections. We should
> associate a conditional variable with the required injection points. Attach
> should create conditional variable in the shared memory for that injection
> point and detach should remove it. I see this being mentioned in the commit
> message, but I think it's something we need in the first version of "wait" to
> be useful.

More to the point, I actually disagree with that, because it could be
possible as well that the same condition variable is used across
multiple points.  At the end, IMHO, the central hash table should hold
zero meta-data associated to an injection point like a counter, an
elog, a condition variable, a sleep time, etc. or any combination of
all these ones, and should just know about how to load a callback with
a library path and a routine name.  I understand that this is leaving
the responsibility of what a callback should do down to the individual
developer implementing a callback into their own extension, where they
should be free to have conditions of their own.

Something that I agree would be very useful for the in-core APIs,
depending on the cases, is to be able to push some information to the
callback at runtime to let a callback decide what to do depending on a
running state, including a condition registered when a point was
attached.  See my argument about an _ARG macro that passes down to the
callback a (void *).

> At some point we may want to control how many times an injection is
> triggered by using a count. Often, I have wanted an ERROR to be thrown
> in a code path once or twice and then stop triggering it. For example
> to test WAL sender restart after a certain event. We can't really time
> Detach correctly to avoid multiple restarts. A count is useful is such
> a case.

Yeah.  That's also something that can be achieved outside the shmem
hash table, so this is intentionally not part of InjectionPointHash.

> +/*
> + * Attach a new injection point.
> + */
> +void
> +InjectionPointAttach(const char *name,
> + const char *library,
> + const char *function)
> +{
> +#ifdef USE_INJECTION_POINTS
>
> In a non-injection-point build this function would be compiled and a call to
> this function would throw an error. This means that any test we write which
> uses injection points can not do so optionally. Tests which can be run with and
> without injection points built will reduce duplication. We should define this
> function as no-op in non-injection-point build to faciliate such tests.

The argument goes both ways, I guess.  I'd rather choose a hard
failure to know that what I am doing is not silently ignored, which is
why I made this choice in the patch.

> Those tests need to also install extension. That's another pain point.
> So anyone wants to run the tests needs to compile the extension too. I
> am wondering whether we should have this functionality in the core
> itself somewhere and will be only useful when built with injection.

That's something that may be discussed on top of the backend APIs,
though this comes down to how and what kind of meta-data should be
associated to the central shmem hash table.  Keeping the shmem hash as
small as possible to keep minimal the traces of this code in core is a
design choice that I'd rather not change.

> Many a times it's only a single backend which needs to be subjected to
> an injection. For inducing ERROR and NOTICE, many a times it's also
> the same backend attached the client session.

Yep.  I've used that across multiple sessions.  For the basic
facility, I think that's the absolute minimum.

> For WAIT, however we
> need a way to inject from some other session.

You can do that already with the patch, no?  If you know that a
different session would cross a given path, you could set a macro in
it.  If you wish for this session to wait before that, it is possible
to use a second point to make it do so.  I've used such techniques as
well for more complex reproducible failures than what I've posted in
the patch series.  In the last months, I've topped a TAP test to rely
on 5 deterministic points, I think.  Or perhaps 6.  That was a fun
exercise, for a TAP test coded while self-complaining about the core
backend code that does not make this stuff easier.

> We might be able to use
> current signalling mechanism for that (wake sends SIGUSR1 with
> reason). Leaving aside WAIT for a moment when the same backend's
> behaviour is being controlled, do we really need shared memory and
> also affect all the running backends. I see some discussion about
> being able to trigger only for a given PID, but when that PID of the
> current backend itself shared memory is not required.

I am not convinced that there is any need for signalling in most
cases, as long as you know beforehand the PID of the session you'd
like to stop, because this would still require a second session to
register a condition based on the PID known.  Another possibility that
I can think of is to use a custom wait event with a second point to
setup a different condition.  At the end, my point is that it is
possible to control everything in some extension code that holds the
callbacks, with an extra shmem area in the extension that associates
some meta-data with a point name, for instance.
--
Michael

Attachment

Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Ashutosh Bapat
Date:
On Tue, Nov 21, 2023 at 6:56 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> >> This facility is hidden behind a specific configure/Meson switch,
> >> making it a no-op by default:
> >> --enable-injection-points
> >> -Dinjection_points={ true | false }
> >
> > That's useful, but we will also see demand to enable those in
> > production (under controlled circumstances). But let the functionality
> > mature under a separate flag and developer builds before used in
> > production.
>
> Hmm.  Okay.  I'd still keep that under a compile switch for now
> anyway to keep a cleaner separation and avoid issues where it would be
> possible to inject code in a production build.  Note that I was
> planning to switch one of my buildfarm animals to use it should this
> stuff make it into the tree.  And people would be free to use it if
> they want.  If in production, that would be a risk, IMO.

makes sense. Just to be clear - by "in production" I mean user
installations - they may be testing environments or production
environments.

>
> > +/*
> > + * Local cache of injection callbacks already loaded, stored in
> > + * TopMemoryContext.
> > + */
> > +typedef struct InjectionPointArrayEntry
> > +{
> > + char name[INJ_NAME_MAXLEN];
> > + InjectionPointCallback callback;
> > +} InjectionPointArrayEntry;
> > +
> > +static InjectionPointArrayEntry *InjectionPointCacheArray = NULL;
> >
> > Initial size of this array is small, but given the size of code in a
> > given path to be tested, I fear that this may not be sufficient. I
> > feel it's better to use hash table whose APIs are already available.
>
> I've never seen in recent years a need for a given test to use more
> than 4~5 points.  But I guess that you've seen more than that wanted
> in a prod environment with a fork of Postgres?

A given case may not require more than 4 -5 points but users may
create scripts with many frequently required injection points and
install handlers for them.

>
> I'm OK to switch that to use a hash table in the initial
> implementation, even for a low number of entries with points that are
> not in hot code paths that should be OK.  At least for my cases
> related to testing that's OK.  Am I right to assume that you mean a
> HTAB in TopMemoryContext?

Yes.

>
> > I am glad that it covers the frequently needed injections error, notice and
> > wait. If someone adds multiple injection points for wait and all of them are
> > triggered (in different backends), test_injection_points_wake() would
> > wake them all. When debugging cascaded functionality it's not easy to
> > follow sequence add, trigger, wake for multiple injections. We should
> > associate a conditional variable with the required injection points. Attach
> > should create conditional variable in the shared memory for that injection
> > point and detach should remove it. I see this being mentioned in the commit
> > message, but I think it's something we need in the first version of "wait" to
> > be useful.
>
> More to the point, I actually disagree with that, because it could be
> possible as well that the same condition variable is used across
> multiple points.  At the end, IMHO, the central hash table should hold
> zero meta-data associated to an injection point like a counter, an
> elog, a condition variable, a sleep time, etc. or any combination of
> all these ones, and should just know about how to load a callback with
> a library path and a routine name.  I understand that this is leaving
> the responsibility of what a callback should do down to the individual
> developer implementing a callback into their own extension, where they
> should be free to have conditions of their own.
>
> Something that I agree would be very useful for the in-core APIs,
> depending on the cases, is to be able to push some information to the
> callback at runtime to let a callback decide what to do depending on a
> running state, including a condition registered when a point was
> attached.  See my argument about an _ARG macro that passes down to the
> callback a (void *).

The injection_run function is called from the place where the
injection point is declared but that place does not know what
injection function is going to be run. So a user can not pass
arguments to injection declaration. What injection to run is decided
by the injection_attach and thus one can pass arguments to it but then
injection_attach stores the information in the shared memory from
where it's picked up by injection_run. So even though you don't want
to store the arguments in the shared memory, you are creating a design
which takes us towards that direction eventually - otherwise users
will end up writing many injection functions - one for each possible
combination of count, sleep, conditional variable etc. But let's see
whether that happens to be the case in practice. We will need to
evolve this feature based on usage.

>
> > Those tests need to also install extension. That's another pain point.
> > So anyone wants to run the tests needs to compile the extension too. I
> > am wondering whether we should have this functionality in the core
> > itself somewhere and will be only useful when built with injection.
>
> That's something that may be discussed on top of the backend APIs,
> though this comes down to how and what kind of meta-data should be
> associated to the central shmem hash table.  Keeping the shmem hash as
> small as possible to keep minimal the traces of this code in core is a
> design choice that I'd rather not change.

shmem hash size won't depend upon the number of functions we write in
the core. Yes it will add to the core code and may add maintenance
burden. So I understand your inclination to keep the core minimal.

>
> > Many a times it's only a single backend which needs to be subjected to
> > an injection. For inducing ERROR and NOTICE, many a times it's also
> > the same backend attached the client session.
>
> Yep.  I've used that across multiple sessions.  For the basic
> facility, I think that's the absolute minimum.
>
> > For WAIT, however we
> > need a way to inject from some other session.
>
> You can do that already with the patch, no?  If you know that a
> different session would cross a given path, you could set a macro in
> it.  If you wish for this session to wait before that, it is possible
> to use a second point to make it do so.  I've used such techniques as
> well for more complex reproducible failures than what I've posted in
> the patch series.  In the last months, I've topped a TAP test to rely
> on 5 deterministic points, I think.  Or perhaps 6.  That was a fun
> exercise, for a TAP test coded while self-complaining about the core
> backend code that does not make this stuff easier.
>
> > We might be able to use
> > current signalling mechanism for that (wake sends SIGUSR1 with
> > reason). Leaving aside WAIT for a moment when the same backend's
> > behaviour is being controlled, do we really need shared memory and
> > also affect all the running backends. I see some discussion about
> > being able to trigger only for a given PID, but when that PID of the
> > current backend itself shared memory is not required.
>
> I am not convinced that there is any need for signalling in most
> cases, as long as you know beforehand the PID of the session you'd
> like to stop, because this would still require a second session to
> register a condition based on the PID known.  Another possibility that
> I can think of is to use a custom wait event with a second point to
> setup a different condition.  At the end, my point is that it is
> possible to control everything in some extension code that holds the
> callbacks, with an extra shmem area in the extension that associates
> some meta-data with a point name, for instance.

If the session which attaches to an injection point is same as the
session where the injection point is triggered (most of the ERROR and
NOTICE injections will see this pattern), we don't need shared memory.
There's a performance penalty to it since injection_run will look up
shared memory. For WAIT we may or may not need shared memory. But
let's see what other think and what usage patterns we see.


--
Best Wishes,
Ashutosh Bapat



Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Michael Paquier
Date:
On Wed, Nov 22, 2023 at 09:23:21PM +0530, Ashutosh Bapat wrote:
> On Tue, Nov 21, 2023 at 6:56 AM Michael Paquier <michael@paquier.xyz> wrote:
>> I've never seen in recent years a need for a given test to use more
>> than 4~5 points.  But I guess that you've seen more than that wanted
>> in a prod environment with a fork of Postgres?
>
> A given case may not require more than 4 -5 points but users may
> create scripts with many frequently required injection points and
> install handlers for them.

Sure, if a callback is generic enough it could be shared across
multiple points.

> The injection_run function is called from the place where the
> injection point is declared but that place does not know what
> injection function is going to be run. So a user can not pass
> arguments to injection declaration.

It is possible to make that predictible but it means that a callback
is most likely to be used by one single point.  This makes extensions
in charge of holding the callbacks more complicated, at the benefit of
keeping a minimal footprint in the backend code.

> What injection to run is decided
> by the injection_attach and thus one can pass arguments to it but then
> injection_attach stores the information in the shared memory from
> where it's picked up by injection_run. So even though you don't want
> to store the arguments in the shared memory, you are creating a design
> which takes us towards that direction eventually - otherwise users
> will end up writing many injection functions - one for each possible
> combination of count, sleep, conditional variable etc. But let's see
> whether that happens to be the case in practice. We will need to
> evolve this feature based on usage.

A one-one mapping between callback and point is not always necessary.
If you wish to use a combination of N points with a sleep callback and
different sleep times, one can just register a second shmem area in
the extension holding the callbacks that links the point names with
the sleep time to use.

> shmem hash size won't depend upon the number of functions we write in
> the core. Yes it will add to the core code and may add maintenance
> burden. So I understand your inclination to keep the core minimal.

Yeah, without a clear benefit, my point is just to throw the
responsibility to extension developers for now, which would mean the
addition of tests that depend on test_injection_points/, or just
install this extension optionally in other code path that need it.
Maybe 0004 should be in src/test/recovery/ and do that, actually..
I'll most likely agree with extending all the backend stuff in a more
meaningful way, but I am not sure which method should be enforced.

> If the session which attaches to an injection point is same as the
> session where the injection point is triggered (most of the ERROR and
> NOTICE injections will see this pattern), we don't need shared memory.
> There's a performance penalty to it since injection_run will look up
> shared memory. For WAIT we may or may not need shared memory. But
> let's see what other think and what usage patterns we see.

The first POC of the patch that you can find at the top of this thread
did that, actually, but this is too limited.  IMO, linking things to a
central table is just *much* more useful.

I've implemented a v5 that switches the cache to use a seconf hash
table on TopMemoryContext for the cache instead of an array.  This
makes the cache handling slightly cleaner, so your suggestion was
right.  0003 and 0004 are the same as previously, passing or failing
under the same conditions.  I'm wondering if folks have other comments
about 0001 and 0002?  It sounds to me like the consensus is that this
stuff is useful and that there are no string objections, so feel free
to comment.

I don't want to propose 0003 in the tree, just an improved version of
0004 for the test coverage (still need to improve that).
--
Michael

Attachment

Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Ashutosh Bapat
Date:
On Fri, Nov 24, 2023 at 7:26 AM Michael Paquier <michael@paquier.xyz> wrote:
> If you wish to use a combination of N points with a sleep callback and
> different sleep times, one can just register a second shmem area in
> the extension holding the callbacks that links the point names with
> the sleep time to use.
>

Interesting idea. For that the callback needs to know the injection
point name. At least we should pass that to the callback. It's trivial
thing to do.

> > shmem hash size won't depend upon the number of functions we write in
> > the core. Yes it will add to the core code and may add maintenance
> > burden. So I understand your inclination to keep the core minimal.
>
> Yeah, without a clear benefit, my point is just to throw the
> responsibility to extension developers for now, which would mean the
> addition of tests that depend on test_injection_points/, or just
> install this extension optionally in other code path that need it.
> Maybe 0004 should be in src/test/recovery/ and do that, actually..

That might work, but in order to run tests in that directory one has
to also install the extension. Do we have precedence for such kind of
dependency?

>
> > If the session which attaches to an injection point is same as the
> > session where the injection point is triggered (most of the ERROR and
> > NOTICE injections will see this pattern), we don't need shared memory.
> > There's a performance penalty to it since injection_run will look up
> > shared memory. For WAIT we may or may not need shared memory. But
> > let's see what other think and what usage patterns we see.
>
> The first POC of the patch that you can find at the top of this thread
> did that, actually, but this is too limited.  IMO, linking things to a
> central table is just *much* more useful.
>
> I've implemented a v5 that switches the cache to use a seconf hash
> table on TopMemoryContext for the cache instead of an array.  This
> makes the cache handling slightly cleaner, so your suggestion was
> right.

glad that you liked the outcome.

> 0003 and 0004 are the same as previously, passing or failing
> under the same conditions.  I'm wondering if folks have other comments
> about 0001 and 0002?  It sounds to me like the consensus is that this
> stuff is useful

I think so.

> and that there are no string objections, so feel free
> to comment.

Let's get some more opinions on the design. I will review the detailed
code then.

>
> I don't want to propose 0003 in the tree, just an improved version of
> 0004 for the test coverage (still need to improve that).

Are you working on v6 already?

--
Best Wishes,
Ashutosh Bapat



Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Michael Paquier
Date:
On Fri, Nov 24, 2023 at 04:37:58PM +0530, Ashutosh Bapat wrote:
> Interesting idea. For that the callback needs to know the injection
> point name. At least we should pass that to the callback. It's trivial
> thing to do.

This is what's done from the beginning, as well as of 0001 in the v5
series:
+INJECTION_POINT(name);
[...]
+   injection_callback(name);

> That might work, but in order to run tests in that directory one has
> to also install the extension. Do we have precedence for such kind of
> dependency?

Yes, please see EXTRA_INSTALL in some of the Makefiles.  This can
install stuff from paths different than the location where the tests
are run.

>> and that there are no string objections, so feel free
>> to comment.
>
> Let's get some more opinions on the design. I will review the detailed
> code then.

Sure.  Thanks.

>> I don't want to propose 0003 in the tree, just an improved version of
>> 0004 for the test coverage (still need to improve that).
>
> Are you working on v6 already?

No, what would be the point at this stage?  I dont have much more to
add to 0001 and 0002 at the moment, which focus on the core of the
problem.
--
Michael

Attachment

Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Ashutosh Bapat
Date:
On Fri, Nov 24, 2023 at 7:37 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Nov 24, 2023 at 04:37:58PM +0530, Ashutosh Bapat wrote:
> > Interesting idea. For that the callback needs to know the injection
> > point name. At least we should pass that to the callback. It's trivial
> > thing to do.
>
> This is what's done from the beginning, as well as of 0001 in the v5
> series:
> +INJECTION_POINT(name);
> [...]
> +   injection_callback(name);

In my first look I missed the actual call to the injection callback in
InjectionPointRun()
injection_callback(name);

Sorry for that.

The way I see it is that an extension using this functionality will
create an auxiliary lookup table keyed by the injection point name to
obtain the injection point specific arguments (sleep time, count etc.)
in the shared memory or local memory. Every time an injection callback
is called it will consult this look up table to get the arguments.
That looks ok to me. There might be other ways to achieve the same
effect. We will learn and absorb whatever benefits core and the users.
I like that.

>
> > That might work, but in order to run tests in that directory one has
> > to also install the extension. Do we have precedence for such kind of
> > dependency?
>
> Yes, please see EXTRA_INSTALL in some of the Makefiles.  This can
> install stuff from paths different than the location where the tests
> are run.

WFM then.

>
> >> and that there are no string objections, so feel free
> >> to comment.
> >
> > Let's get some more opinions on the design. I will review the detailed
> > code then.
>
> Sure.  Thanks.
>
> >> I don't want to propose 0003 in the tree, just an improved version of
> >> 0004 for the test coverage (still need to improve that).
> >
> > Are you working on v6 already?
>
> No, what would be the point at this stage?  I dont have much more to
> add to 0001 and 0002 at the moment, which focus on the core of the
> problem.

Since you wroten "(still need to improve ...) I thought you are
working on v6. No problem. Sorry for the confusion.

--
Best Wishes,
Ashutosh Bapat



Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Michael Paquier
Date:
On Mon, Nov 27, 2023 at 12:14:05PM +0530, Ashutosh Bapat wrote:
> Since you wroten "(still need to improve ...) I thought you are
> working on v6. No problem. Sorry for the confusion.

I see why my previous message could be confusing.  Sorry about that.
--
Michael

Attachment
On Tue, Nov 28, 2023 at 4:07 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Nov 27, 2023 at 12:14:05PM +0530, Ashutosh Bapat wrote:
> > Since you wroten "(still need to improve ...) I thought you are
> > working on v6. No problem. Sorry for the confusion.
>
> I see why my previous message could be confusing.  Sorry about that.

I haven't specifically done a review or testing of this patch, but I
have used this for testing the CLOG group update code with my
SLRU-specific changes and I found it quite helpful to test some of the
concurrent areas where you need to stop processing somewhere in the
middle of the code and testing that area without this kind of
injection point framework is really difficult or may not be even
possible.  We wanted to test the case of clog group update where we
can get multiple processes added to a single group and get the xid
status updated by the group leader, you can refer to my test in that
thread[1] (the last patch test_group_commit.patch is using this
framework for testing).  Overall I feel this framework is quite useful
and easy to use as well.

[1] https://www.postgresql.org/message-id/CAFiTN-udSTGG_t5n9Z3eBbb4_%3DzNoKU%2B8FP-S6zpv-r4Gm-Y%2BQ%40mail.gmail.com


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Michael Paquier
Date:
On Mon, Dec 11, 2023 at 11:09:45AM +0530, Dilip Kumar wrote:
> I haven't specifically done a review or testing of this patch, but I
> have used this for testing the CLOG group update code with my
> SLRU-specific changes and I found it quite helpful to test some of the
> concurrent areas where you need to stop processing somewhere in the
> middle of the code and testing that area without this kind of
> injection point framework is really difficult or may not be even
> possible.  We wanted to test the case of clog group update where we
> can get multiple processes added to a single group and get the xid
> status updated by the group leader, you can refer to my test in that
> thread[1] (the last patch test_group_commit.patch is using this
> framework for testing).

Could you be more specific?  test_group_commit.patch includes this
line but there is nothing specific about this injection point getting
used in a test or a callback assigned to it:
./test_group_commit.patch:+    INJECTION_POINT("ClogGroupCommit");

> Overall I feel this framework is quite useful
> and easy to use as well.

Cool, thanks.
--
Michael

Attachment
On Mon, Dec 11, 2023 at 3:14 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Dec 11, 2023 at 11:09:45AM +0530, Dilip Kumar wrote:
> > I haven't specifically done a review or testing of this patch, but I
> > have used this for testing the CLOG group update code with my
> > SLRU-specific changes and I found it quite helpful to test some of the
> > concurrent areas where you need to stop processing somewhere in the
> > middle of the code and testing that area without this kind of
> > injection point framework is really difficult or may not be even
> > possible.  We wanted to test the case of clog group update where we
> > can get multiple processes added to a single group and get the xid
> > status updated by the group leader, you can refer to my test in that
> > thread[1] (the last patch test_group_commit.patch is using this
> > framework for testing).
>
> Could you be more specific?  test_group_commit.patch includes this
> line but there is nothing specific about this injection point getting
> used in a test or a callback assigned to it:
> ./test_group_commit.patch:+     INJECTION_POINT("ClogGroupCommit");

Oops, I only included the code changes where I am adding injection
points and some comments to verify that, but missed the actual test
file. Attaching it here.

Note:  I think the latest patches are conflicting with the head, can you rebase?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Adding facility for injection points (or probe points?) for more advanced tests

From
Michael Paquier
Date:
On Tue, Dec 12, 2023 at 10:27:09AM +0530, Dilip Kumar wrote:
> Oops, I only included the code changes where I am adding injection
> points and some comments to verify that, but missed the actual test
> file. Attaching it here.

I see.  Interesting that this requires persistent connections to work.
That's something I've found clunky to rely on when the scenarios a
test needs to deal with are rather complex.  That's an area that could
be made easier to use outside of this patch..  Something got proposed
by Andrew Dunstan to make the libpq routines usable through a perl
module, for example.

> Note:  I think the latest patches are conflicting with the head, can you rebase?

Indeed, as per the recent manipulations in ipci.c for the shmem
initialization areas.  Here goes a v6.
--
Michael

Attachment
On Tue, Dec 12, 2023 at 4:15 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Dec 12, 2023 at 10:27:09AM +0530, Dilip Kumar wrote:
> > Oops, I only included the code changes where I am adding injection
> > points and some comments to verify that, but missed the actual test
> > file. Attaching it here.
>
> I see.  Interesting that this requires persistent connections to work.
> That's something I've found clunky to rely on when the scenarios a
> test needs to deal with are rather complex.  That's an area that could
> be made easier to use outside of this patch..  Something got proposed
> by Andrew Dunstan to make the libpq routines usable through a perl
> module, for example.
>
> > Note:  I think the latest patches are conflicting with the head, can you rebase?
>
> Indeed, as per the recent manipulations in ipci.c for the shmem
> initialization areas.  Here goes a v6.

Sorry for replying late here. Another minor conflict has risen again.
It's minor enough to be ignored for a review.

On Tue, Nov 21, 2023 at 6:56 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Nov 20, 2023 at 04:53:45PM +0530, Ashutosh Bapat wrote:
> > On Wed, Oct 25, 2023 at 9:43 AM Michael Paquier <michael@paquier.xyz> wrote:
> >> I have added some documentation to explain that, as well.  I am not
> >> wedded to the name proposed in the patch, so if you feel there is
> >> better, feel free to propose ideas.
> >
> > Actually with Attach and Detach terminology, INJECTION_POINT becomes
> > the place where we "declare" the injection point. So the documentation
> > needs to first explain INJECTION_POINT and then explain the other
> > operations.
>
> Sure.

This discussion has not been addressed in v6. I think the interface
needs to be documented in the order below
INJECTION_POINT - this declares an injection point - i.e. a place in
code where an external code can be injected (and run).
InjectionPointAttach() - this is used to associate ("attach") external
code to an injection point.
InjectionPointDetach() - this is used to disassociate ("detach")
external code from an injection point.

Specifying that InjectionPointAttach() "defines" an injection point
gives an impression that the injection point will be "somehow" added
to the code by calling InjectionPointAttach() which is not true. For
InjectionPointAttach() to be useful, the first argument to it should
be something already "declared" in the code using INJECTION_POINT().
Hence INJECTION_POINT needs to be mentioned in the documentation
first, followed by Attach and detach. The documentation needs to be
rephrased to use terms "declare", "attach" and "detach" instead of
"define", "run". The first set is aligned with the functionality
whereas the second set is aligned with the implementation.

Even if an INJECTION_POINT is not "declared" attach would succeed but
doesn't do anything. I think this needs to be made clear in the
documentation. Better if we could somehow make Attach() fail if the
specified injection point is not "declared" using INJECTION_POINT. Of
course we don't want to bloat the hash table with all "declared"
injection points even if they aren't being attached to and hence not
used. I think, exposing the current injection point strings as
#defines and encouraging users to use these macros instead of string
literals will be a good start.

With the current implementation it's possible to "declare" injection
point with same name at multiple places. It's useful but is it
intended?

/* Field sizes */
#define INJ_NAME_MAXLEN 64
#define INJ_LIB_MAXLEN 128
#define INJ_FUNC_MAXLEN 128
I think these limits should be either documented or specified in the
error messages for users to fix their code in case of
errors/unexpected behaviour.

Here are some code level comments on 0001

+typedef struct InjectionPointArrayEntry

This is not an array entry anymore. I think we should rename
InjectionPointEntry as SharedInjectionPointEntry and InjectionPointArrayEntry
as LocalInjectionPointEntry.

+/* utilities to handle the local array cache */
+static void
+injection_point_cache_add(const char *name,
+ InjectionPointCallback callback)
+{
... snip ...
+
+ entry = (InjectionPointCacheEntry *)
+ hash_search(InjectionPointCache, name, HASH_ENTER, &found);
+
+ if (!found)

The function is called only when the injection point is not found in the local
cache. So this condition will always be true. An Assert will help to make it
clear and also prevent an unintended callback replacement.

+#ifdef USE_INJECTION_POINTS
+static bool
+file_exists(const char *name)

There's similar function in jit.c and dfmgr.c. Can we not reuse that code?

+ /* Save the entry */
+ memcpy(entry_by_name->name, name, sizeof(entry_by_name->name));
+ entry_by_name->name[INJ_NAME_MAXLEN - 1] = '\0';
+ memcpy(entry_by_name->library, library, sizeof(entry_by_name->library));
+ entry_by_name->library[INJ_LIB_MAXLEN - 1] = '\0';
+ memcpy(entry_by_name->function, function, sizeof(entry_by_name->function));
+ entry_by_name->function[INJ_FUNC_MAXLEN - 1] = '\0';

Most of the code is using strncpy instead of memcpy. Why is this code different?

+ injection_callback = injection_point_cache_get(name);
+ if (injection_callback == NULL)
+ {
+ char path[MAXPGPATH];
+
+ /* Found, so just run the callback registered */

The condition indicates that the callback was not found. Comment looks wrong.

+ snprintf(path, MAXPGPATH, "%s/%s%s", pkglib_path,
+ entry_by_name->library, DLSUFFIX);
+
+ if (!file_exists(path))
+ elog(ERROR, "could not find injection library \"%s\"", path);
+
+ injection_callback = (InjectionPointCallback)
+ load_external_function(path, entry_by_name->function, true, NULL);
+
+ /* add it to the local cache when found */
+ injection_point_cache_add(name, injection_callback);
+ }
+

Consider case

Backend 2
InjectionPointAttach("xyz", "abc", "pqr");

Backend 1
INJECTION_POINT("xyz");

Backend 2
InjectionPointDetach("xyz");
InjectionPointAttach("xyz", "uvw", "lmn");

Backend 1
INJECTION_POINT("xyz");

IIUC, the last INJECTION_POINT would run abc.pqr instead of uvw.lmn.
Am I correct?

To fix this, we have to a. either save qualified name of the function in local
cache too OR resolve the function name every time INJECTION_POINT is invoked
and is found in the shared hash table. The first one option is cheaper I think.
But it will be good if we can invalidate the local entry when the global entry
changes. To keep code simple, we may choose to ignore close race conditions
where INJECTION_POINT is run while InjectionPointAttach or InjectionPointDetach
is happening. But this way we don't have to look up shared hash table every
time INJECTION_POINT is invoked thus improving performance.

I will look at 0002 next.

--
Best Wishes,
Ashutosh Bapat



I'd like to spend some more time reviewing this one, but here are a couple
of comments.

On Tue, Dec 12, 2023 at 11:44:57AM +0100, Michael Paquier wrote:
> +/*
> + * Allocate shmem space for dynamic shared hash.
> + */
> +void
> +InjectionPointShmemInit(void)
> +{
> +#ifdef USE_INJECTION_POINTS
> +    HASHCTL        info;
> +
> +    /* key is a NULL-terminated string */
> +    info.keysize = sizeof(char[INJ_NAME_MAXLEN]);
> +    info.entrysize = sizeof(InjectionPointEntry);
> +    InjectionPointHash = ShmemInitHash("InjectionPoint hash",
> +                                       INJECTION_POINT_HASH_INIT_SIZE,
> +                                       INJECTION_POINT_HASH_MAX_SIZE,
> +                                       &info,
> +                                       HASH_ELEM | HASH_STRINGS);
> +#endif
> +}

Should we specify HASH_FIXED_SIZE, too?  This hash table will be in the
main shared memory segment and therefore won't be able to expand too far
beyond the declared maximum size.

> +    /*
> +     * Check if the callback exists in the local cache, to avoid unnecessary
> +     * external loads.
> +     */
> +    injection_callback = injection_point_cache_get(name);
> +    if (injection_callback == NULL)
> +    {
> +        char        path[MAXPGPATH];
> +
> +        /* Found, so just run the callback registered */
> +        snprintf(path, MAXPGPATH, "%s/%s%s", pkglib_path,
> +                 entry_by_name->library, DLSUFFIX);
> +
> +        if (!file_exists(path))
> +            elog(ERROR, "could not find injection library \"%s\"", path);
> +
> +        injection_callback = (InjectionPointCallback)
> +            load_external_function(path, entry_by_name->function, true, NULL);
> +
> +        /* add it to the local cache when found */
> +        injection_point_cache_add(name, injection_callback);
> +    }

I'm wondering how important it is to cache the callbacks locally.
load_external_function() won't reload an already-loaded library, so AFAICT
this is ultimately just saving a call to dlsym().

> +     <literal>name</literal> is the name of the injection point, that
> +     will execute the <literal>function</literal> loaded from
> +     <literal>library</library>.
> +     Injection points are saved in a hash table in shared memory, and
> +     last until the server is shut down.
> +    </para>

I think </library> is supposed to be </literal> here.

> +++ b/src/test/modules/test_injection_points/t/002_invalid_checkpoint_after_promote.pl

0003 and 0004 add tests to the test_injection_points module.  Is the idea
that we'd add any tests that required injection points here?  I think it'd
be better if we could move the tests closer to the logic they're testing,
but perhaps that is difficult because you also need to define the callback
functions somewhere.  Hm...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Tue, Jan 02, 2024 at 11:14:56PM -0600, Nathan Bossart wrote:
> Should we specify HASH_FIXED_SIZE, too?  This hash table will be in the
> main shared memory segment and therefore won't be able to expand too far
> beyond the declared maximum size.

Good point.

> I'm wondering how important it is to cache the callbacks locally.
> load_external_function() won't reload an already-loaded library, so AFAICT
> this is ultimately just saving a call to dlsym().

This keeps a copy to a callback under the same address space, and I
guess that it would matter if the code where a callback is added gets
very hot because this means less function pointers.  At the end I
would keep the cache as the code to handle it is neither complex nor
long, while being isolated in its own paths.

> I think </library> is supposed to be </literal> here.

Okay.

> 0003 and 0004 add tests to the test_injection_points module.  Is the idea
> that we'd add any tests that required injection points here?  I think it'd
> be better if we could move the tests closer to the logic they're testing,
> but perhaps that is difficult because you also need to define the callback
> functions somewhere.  Hm...

Yeah.  Agreed that the final result should not have these tests in the
module test_injection_points.  What I was thinking here is to move
002_invalid_checkpoint_after_promote.pl to src/test/recovery/ and pull
the module with the callbacks with an EXTRA_INSTALL.
--
Michael

Attachment
On Thu, Jan 4, 2024 at 5:23 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> > 0003 and 0004 add tests to the test_injection_points module.  Is the idea
> > that we'd add any tests that required injection points here?  I think it'd
> > be better if we could move the tests closer to the logic they're testing,
> > but perhaps that is difficult because you also need to define the callback
> > functions somewhere.  Hm...
>
> Yeah.  Agreed that the final result should not have these tests in the
> module test_injection_points.  What I was thinking here is to move
> 002_invalid_checkpoint_after_promote.pl to src/test/recovery/ and pull
> the module with the callbacks with an EXTRA_INSTALL.

0003 and 0004 are using the extension in this module for some serious
testing. The name of the extension test_injection_point indicates that
it's for testing injection points and not for some serious use of
injection callbacks it adds. Changes 0003 and 0004 suggest otherwise.
I suggest we move test_injection_points from src/test/modules to
contrib/ and rename it as "injection_points". The test files may still
be named as test_injection_point. The TAP tests in 0003 and 0004 once
moved to their appropriate places, will load injection_point extension
and use it. That way predefined injection point callbacks will also be
available for others to use.

--
Best Wishes,
Ashutosh Bapat



On Tue, Jan 2, 2024 at 3:36 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

>
> I will look at 0002 next.

One more comment on 0001
InjectionPointAttach() doesn't test whether the given function exists
in the given library. Even if InjectionPointAttach() succeeds,
INJECTION_POINT might throw error because the function doesn't exist.
This can be seen as an unwanted behaviour. I think
InjectionPointAttach() should test whether the function exists and
possibly load it as well by adding it to the local cache.

0002 comments
--- /dev/null
+++ b/src/test/modules/test_injection_points/expected/test_injection_points.out

When built without injection point support, this test fails. We should
add an alternate output file for such a build so that the behaviour
with and without injection point support is tested. Or set up things
such that the test is not run under make check in that directory. I
will prefer the first option.

+
+SELECT test_injection_points_run('TestInjectionError'); -- error
+ERROR: error triggered for injection point TestInjectionError
+-- Re-load and run again.

What's getting Re-loaded here? \c will create a new connection and
thus a new backend. Maybe the comment should say "test in a fresh
backend" or something of that sort?

+
+SELECT test_injection_points_run('TestInjectionError'); -- error
+ERROR: error triggered for injection point TestInjectionError
+-- Remove one entry and check the other one.

Looks confusing to me, we are testing the one removed as well. Am I
missing something?

+(1 row)
+
+-- All entries removed, nothing happens

We aren't removing all entries TestInjectionLog2 is still there. Am I
missing something?

0003 looks mostly OK.

0004 comments

+
+# after recovery, the server will not start, and log PANIC: could not
locate a valid checkpoint record

IIUC the comment describes the behaviour with 7863ee4def65 reverted.
But the test after this comment is written for the behaviour with
7863ee4def65. That's confusing. Is the intent to describe both
behaviours in the comment?

+
+ /* And sleep.. */
+ ConditionVariablePrepareToSleep(&inj_state->wait_point);
+ ConditionVariableSleep(&inj_state->wait_point, test_injection_wait_event);
+ ConditionVariableCancelSleep();

According to the prologue of ConditionVariableSleep(), that function
should be called in a loop checking for the desired condition. All the
callers that I examined follow that pattern. I think we need to follow
that pattern here as well.

Below comment from ConditionVariableTimedSleep() makes me think that
the caller of ConditionVariableSleep() can be woken up even if the
condition variable was not signaled. That's why the while() loop
around ConditionVariableSleep().

* If we're still in the wait list, then the latch must have been set
* by something other than ConditionVariableSignal; though we don't
* guarantee not to return spuriously, we'll avoid this obvious case.
*/.

That's all I have for now.

--
Best Wishes,
Ashutosh Bapat



On Thu, Jan 04, 2024 at 08:53:11AM +0900, Michael Paquier wrote:
> On Tue, Jan 02, 2024 at 11:14:56PM -0600, Nathan Bossart wrote:
>> I'm wondering how important it is to cache the callbacks locally.
>> load_external_function() won't reload an already-loaded library, so AFAICT
>> this is ultimately just saving a call to dlsym().
> 
> This keeps a copy to a callback under the same address space, and I
> guess that it would matter if the code where a callback is added gets
> very hot because this means less function pointers.  At the end I
> would keep the cache as the code to handle it is neither complex nor
> long, while being isolated in its own paths.

Fair enough.

>> 0003 and 0004 add tests to the test_injection_points module.  Is the idea
>> that we'd add any tests that required injection points here?  I think it'd
>> be better if we could move the tests closer to the logic they're testing,
>> but perhaps that is difficult because you also need to define the callback
>> functions somewhere.  Hm...
> 
> Yeah.  Agreed that the final result should not have these tests in the
> module test_injection_points.  What I was thinking here is to move
> 002_invalid_checkpoint_after_promote.pl to src/test/recovery/ and pull
> the module with the callbacks with an EXTRA_INSTALL.

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Thu, Jan 04, 2024 at 06:04:20PM +0530, Ashutosh Bapat wrote:
> 0003 and 0004 are using the extension in this module for some serious
> testing. The name of the extension test_injection_point indicates that
> it's for testing injection points and not for some serious use of
> injection callbacks it adds. Changes 0003 and 0004 suggest otherwise.

Yeah, I think test_injection_point should be reserved for testing the
injection point machinery.

> I suggest we move test_injection_points from src/test/modules to
> contrib/ and rename it as "injection_points". The test files may still
> be named as test_injection_point. The TAP tests in 0003 and 0004 once
> moved to their appropriate places, will load injection_point extension
> and use it. That way predefined injection point callbacks will also be
> available for others to use.

Rather than defining a module somewhere that tests would need to load,
should we just put the common callbacks in the core server?  Unless there's
a strong reason to define them elsewhere, that could be a nice way to save
a step in the tests.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Thu, Jan 04, 2024 at 04:31:02PM -0600, Nathan Bossart wrote:
> On Thu, Jan 04, 2024 at 06:04:20PM +0530, Ashutosh Bapat wrote:
>> 0003 and 0004 are using the extension in this module for some serious
>> testing. The name of the extension test_injection_point indicates that
>> it's for testing injection points and not for some serious use of
>> injection callbacks it adds. Changes 0003 and 0004 suggest otherwise.
>
> Yeah, I think test_injection_point should be reserved for testing the
> injection point machinery.

Sure.  FWIW, it makes sense to me to keep the SQL interface and the
callbacks in the module, per the reasons below.

>> I suggest we move test_injection_points from src/test/modules to
>> contrib/ and rename it as "injection_points". The test files may still
>> be named as test_injection_point. The TAP tests in 0003 and 0004 once
>> moved to their appropriate places, will load injection_point extension
>> and use it. That way predefined injection point callbacks will also be
>> available for others to use.
>
> Rather than defining a module somewhere that tests would need to load,
> should we just put the common callbacks in the core server?  Unless there's
> a strong reason to define them elsewhere, that could be a nice way to save
> a step in the tests.

Nah, having some pre-existing callbacks existing in the backend is
against the original minimalistic design spirit.  These would also
require an SQL interface, and the interface design also depends on the
functions registering them when pushing down custom conditions.
Pushing that down to extensions to do what they want will lead to less
noise, particularly if you consider that we will most likely want to
tweak the callback interfaces for backpatched bugs.  That's also why I
think contrib/ is not a good idea, src/test/modules/ serving the
actual testing purpose here.
--
Michael

Attachment
On Fri, Jan 5, 2024 at 5:08 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> >> I suggest we move test_injection_points from src/test/modules to
> >> contrib/ and rename it as "injection_points". The test files may still
> >> be named as test_injection_point. The TAP tests in 0003 and 0004 once
> >> moved to their appropriate places, will load injection_point extension
> >> and use it. That way predefined injection point callbacks will also be
> >> available for others to use.
> >
> > Rather than defining a module somewhere that tests would need to load,
> > should we just put the common callbacks in the core server?  Unless there's
> > a strong reason to define them elsewhere, that could be a nice way to save
> > a step in the tests.
>
> Nah, having some pre-existing callbacks existing in the backend is
> against the original minimalistic design spirit.  These would also
> require an SQL interface, and the interface design also depends on the
> functions registering them when pushing down custom conditions.
> Pushing that down to extensions to do what they want will lead to less
> noise, particularly if you consider that we will most likely want to
> tweak the callback interfaces for backpatched bugs.  That's also why I
> think contrib/ is not a good idea, src/test/modules/ serving the
> actual testing purpose here.

Well, you have already showed that the SQL interface created for the
test module is being used for testing a core feature. The tests for
that should stay somewhere near the other tests for those features.
Using an extension named "test_injection_point" and which resides in a
test module for testing core features doesn't look great. Hence
suggestion to move it to contrib.

--
Best Wishes,
Ashutosh Bapat



On Fri, Jan 05, 2024 at 12:41:33PM +0530, Ashutosh Bapat wrote:
> Well, you have already showed that the SQL interface created for the
> test module is being used for testing a core feature. The tests for
> that should stay somewhere near the other tests for those features.
> Using an extension named "test_injection_point" and which resides in a
> test module for testing core features doesn't look great. Hence
> suggestion to move it to contrib.

I mean why?  We test a bunch of stuff in src/test/modules/, and this
is not intended to be released to the outside world.

Putting that in contrib/ has a lot of extra cost.  One is
documentation and more complexity regarding versioning when it comes
to upgrading it to a new version.  I don't think that it is a good
idea to deal with this extra load of work for something that I'd aim
to be used for having improved *test* coverage, and the build switch
should stay.  Saying that, I'd be OK with renaming the module to
injection_points, but I will fight hard about keeping that in
src/test/modules/.  That's less maintenance headache to think about
when having to deal with complex racy bugs.
--
Michael

Attachment
On Fri, Jan 5, 2024 at 12:49 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> I mean why?  We test a bunch of stuff in src/test/modules/, and this
> is not intended to be released to the outside world.
>
> Putting that in contrib/ has a lot of extra cost.  One is
> documentation and more complexity regarding versioning when it comes
> to upgrading it to a new version.  I don't think that it is a good
> idea to deal with this extra load of work for something that I'd aim
> to be used for having improved *test* coverage, and the build switch
> should stay.  Saying that, I'd be OK with renaming the module to
> injection_points,

Ok. Thanks.

> but I will fight hard about keeping that in
> src/test/modules/.  That's less maintenance headache to think about
> when having to deal with complex racy bugs.

For me getting this feature in code in a usable manner is more
important than its place in the code. I have no plans to fight over
it. :).

--
Best Wishes,
Ashutosh Bapat



On Tue, Dec 12, 2023 at 4:15 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Dec 12, 2023 at 10:27:09AM +0530, Dilip Kumar wrote:
> > Oops, I only included the code changes where I am adding injection
> > points and some comments to verify that, but missed the actual test
> > file. Attaching it here.
>
> I see.  Interesting that this requires persistent connections to work.
> That's something I've found clunky to rely on when the scenarios a
> test needs to deal with are rather complex.  That's an area that could
> be made easier to use outside of this patch..  Something got proposed
> by Andrew Dunstan to make the libpq routines usable through a perl
> module, for example.
>
> > Note:  I think the latest patches are conflicting with the head, can you rebase?
>
> Indeed, as per the recent manipulations in ipci.c for the shmem
> initialization areas.  Here goes a v6.

Some comments in 0001, mostly cosmetics

1.
+/* utilities to handle the local array cache */
+static void
+injection_point_cache_add(const char *name,
+   InjectionPointCallback callback)

I think the comment for this function should be more specific about
adding an entry to the local injection_point_cache_add.  And add
comments for other functions as well e.g. injection_point_cache_get


2.
+typedef struct InjectionPointEntry
+{
+ char name[INJ_NAME_MAXLEN]; /* hash key */
+ char library[INJ_LIB_MAXLEN]; /* library */
+ char function[INJ_FUNC_MAXLEN]; /* function */
+} InjectionPointEntry;

Some comments would be good for the structure

3.

+static bool
+file_exists(const char *name)
+{
+ struct stat st;
+
+ Assert(name != NULL);
+ if (stat(name, &st) == 0)
+ return !S_ISDIR(st.st_mode);
+ else if (!(errno == ENOENT || errno == ENOTDIR))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access file \"%s\": %m", name)));
+ return false;
+}

dfmgr.c has a similar function so can't we reuse it by making that
function external?

4.
+ if (found)
+ {
+ LWLockRelease(InjectionPointLock);
+ elog(ERROR, "injection point \"%s\" already defined", name);
+ }
+
...
+#else
+ elog(ERROR, "Injection points are not supported by this build");

Better to use similar formatting for error output, Injection vs
injection (better not to capitalize the first letter for consistency
pov)

5.
+ * Check first the shared hash table, and adapt the local cache
+ * depending on that as it could be possible that an entry to run
+ * has been removed.
+ */

What if the entry is removed after we have released the
InjectionPointLock? Or this would not cause any harm?


0004:

I think
test_injection_points_wake() and test_injection_wait() can be moved as
part of 0002



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Fri, Jan 05, 2024 at 08:38:22AM +0900, Michael Paquier wrote:
> On Thu, Jan 04, 2024 at 04:31:02PM -0600, Nathan Bossart wrote:
>> Rather than defining a module somewhere that tests would need to load,
>> should we just put the common callbacks in the core server?  Unless there's
>> a strong reason to define them elsewhere, that could be a nice way to save
>> a step in the tests.
> 
> Nah, having some pre-existing callbacks existing in the backend is
> against the original minimalistic design spirit.  These would also
> require an SQL interface, and the interface design also depends on the
> functions registering them when pushing down custom conditions.
> Pushing that down to extensions to do what they want will lead to less
> noise, particularly if you consider that we will most likely want to
> tweak the callback interfaces for backpatched bugs.  That's also why I
> think contrib/ is not a good idea, src/test/modules/ serving the
> actual testing purpose here.

Ah, so IIUC we'd have to put some functions in pg_proc.dat even though they
would only be used for a handful of tests in special builds.  I'd agree
that's not desirable.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Fri, Jan 05, 2024 at 04:18:49PM +0900, Michael Paquier wrote:
> Putting that in contrib/ has a lot of extra cost.  One is
> documentation and more complexity regarding versioning when it comes
> to upgrading it to a new version.  I don't think that it is a good
> idea to deal with this extra load of work for something that I'd aim
> to be used for having improved *test* coverage, and the build switch
> should stay.

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Fri, Jan 05, 2024 at 10:28:47AM -0600, Nathan Bossart wrote:
> +1

Extra note for this thread: it is possible to add a SQL test case for
problems like what's been reported on this thread when facing a
partial write failure:
https://www.postgresql.org/message-id/18259-6e256429825dd435@postgresql.org
--
Michael

Attachment
On Fri, Jan 05, 2024 at 03:00:25PM +0530, Dilip Kumar wrote:
> Some comments in 0001, mostly cosmetics
>
> 1.
> +/* utilities to handle the local array cache */
> +static void
> +injection_point_cache_add(const char *name,
> +   InjectionPointCallback callback)
>
> I think the comment for this function should be more specific about
> adding an entry to the local injection_point_cache_add.  And add
> comments for other functions as well e.g. injection_point_cache_get

And it is not an array anymore.  Note InjectionPointArrayEntry that
still existed.

> 2.
> +typedef struct InjectionPointEntry
> +{
> + char name[INJ_NAME_MAXLEN]; /* hash key */
> + char library[INJ_LIB_MAXLEN]; /* library */
> + char function[INJ_FUNC_MAXLEN]; /* function */
> +} InjectionPointEntry;
>
> Some comments would be good for the structure

Sure.  I've spent more time documenting things in injection_point.c,
addressing any inconsistencies.

> 3.
>
> +static bool
> +file_exists(const char *name)
> +{
> + struct stat st;
> +
> + Assert(name != NULL);
> + if (stat(name, &st) == 0)
> + return !S_ISDIR(st.st_mode);
> + else if (!(errno == ENOENT || errno == ENOTDIR))
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not access file \"%s\": %m", name)));
> + return false;
> +}
>
> dfmgr.c has a similar function so can't we reuse it by making that
> function external?

Yes.  Note that jit.c has an extra copy of it.  I was holding on the
refactoring, but let's bite the bullet and have a single routine.
I've moved that into a 0001 that builds on top of the rest.

> 4.
> + if (found)
> + {
> + LWLockRelease(InjectionPointLock);
> + elog(ERROR, "injection point \"%s\" already defined", name);
> + }
> +
> ...
> +#else
> + elog(ERROR, "Injection points are not supported by this build");
>
> Better to use similar formatting for error output, Injection vs
> injection (better not to capitalize the first letter for consistency
> pov)

Fixed.

> 5.
> + * Check first the shared hash table, and adapt the local cache
> + * depending on that as it could be possible that an entry to run
> + * has been removed.
> + */
>
> What if the entry is removed after we have released the
> InjectionPointLock? Or this would not cause any harm?

With an entry found in the shmem table?  I don't really think that we
need to care about such cases, TBH, because the injection point would
have been found in the table to start with.  This comes down to if we
should try to hold InjectionPointLock while calling the callback, and
that may not be a good idea in some cases if you'd expect a high
concurrency on the callback running.

> 0004:
>
> I think
> test_injection_points_wake() and test_injection_wait() can be moved as
> part of 0002

Nah.  I intend to keep the introduction of this API where it becomes
relevant.  Perhaps this could also use an isolation test?  This could
always be polished once we agree on 0001 and 0002.

(I'll post a v6 a bit later, there are more comments posted here and
there.)
--
Michael

Attachment
(Compiled two separate review emails into a single one)

On Tue, Jan 02, 2024 at 03:36:12PM +0530, Ashutosh Bapat wrote:
> This discussion has not been addressed in v6. I think the interface
> needs to be documented in the order below
> INJECTION_POINT - this declares an injection point - i.e. a place in
> code where an external code can be injected (and run).
> InjectionPointAttach() - this is used to associate ("attach") external
> code to an injection point.
> InjectionPointDetach() - this is used to disassociate ("detach")
> external code from an injection point.
>
> [arguments about doc organization]
>
> Even if an INJECTION_POINT is not "declared" attach would succeed but
> doesn't do anything. I think this needs to be made clear in the
> documentation. Better if we could somehow make Attach() fail if the
> specified injection point is not "declared" using INJECTION_POINT. Of
> course we don't want to bloat the hash table with all "declared"
> injection points even if they aren't being attached to and hence not
> used.

Okay, I can see your point.  I have reorganized the docs in the
following order:
- INJECTION_POINT
- Attach
- Detach

> I think, exposing the current injection point strings as
> #defines and encouraging users to use these macros instead of string
> literals will be a good start.

Nah, I disagree with this one actually.  It is easy to grep for the
macro INJECTION_POINT to be able to achieve the same research job, and
this would make the code more inconsistent when callbacks are run
within extensions which don't care about a #define in a backend
header.

> With the current implementation it's possible to "declare" injection
> point with same name at multiple places. It's useful but is it
> intended?

Yes.  I would recommend not doing that, but I don't see why there
would be a point in restricting that, either.

> /* Field sizes */
> #define INJ_NAME_MAXLEN 64
> #define INJ_LIB_MAXLEN 128
> #define INJ_FUNC_MAXLEN 128
> I think these limits should be either documented or specified in the
> error messages for users to fix their code in case of
> errors/unexpected behaviour.

Adding them to the error messages when attaching is a good idea.
Done.

> This is not an array entry anymore. I think we should rename
> InjectionPointEntry as SharedInjectionPointEntry and InjectionPointArrayEntry
> as LocalInjectionPointEntry.

Yep, fixed.

> +/* utilities to handle the local array cache */
> +static void
> +injection_point_cache_add(const char *name,
> + InjectionPointCallback callback)
> +{
> ... snip ...
> +
> + entry = (InjectionPointCacheEntry *)
> + hash_search(InjectionPointCache, name, HASH_ENTER, &found);
> +
> + if (!found)
>
> The function is called only when the injection point is not found in the local
> cache. So this condition will always be true. An Assert will help to make it
> clear and also prevent an unintended callback replacement.

Right, as coded that seems pointless to make the found conditional.  I
think that I coded it this way when doing some earlier work with this
code, and finished with a simpler thing.

> +#ifdef USE_INJECTION_POINTS
> +static bool
> +file_exists(const char *name)
>
> There's similar function in jit.c and dfmgr.c. Can we not reuse that code?

This has been mentioned in a different comment.  Refactored as of
0001, but there is something here related to EACCES for the JIT path.
Seems weird to me that we would not fail if the JIT library cannot be
accessed when stat() fails.

> + /* Save the entry */
> + memcpy(entry_by_name->name, name, sizeof(entry_by_name->name));
> + entry_by_name->name[INJ_NAME_MAXLEN - 1] = '\0';
> + memcpy(entry_by_name->library, library, sizeof(entry_by_name->library));
> + entry_by_name->library[INJ_LIB_MAXLEN - 1] = '\0';
> + memcpy(entry_by_name->function, function, sizeof(entry_by_name->function));
> + entry_by_name->function[INJ_FUNC_MAXLEN - 1] = '\0';
>
> Most of the code is using strncpy instead of memcpy. Why is this code different?

strncpy() is less used in the backend code.  It comes to a matter of
taste, IMO.

> + injection_callback = injection_point_cache_get(name);
> + if (injection_callback == NULL)
> + {
> + char path[MAXPGPATH];
> +
> + /* Found, so just run the callback registered */
>
> The condition indicates that the callback was not found. Comment looks wrong.

Fixed.

> Consider case
>
> Backend 2
> InjectionPointAttach("xyz", "abc", "pqr");
>
> Backend 1
> INJECTION_POINT("xyz");
>
> Backend 2
> InjectionPointDetach("xyz");
> InjectionPointAttach("xyz", "uvw", "lmn");
>
> Backend 1
> INJECTION_POINT("xyz");
>
> IIUC, the last INJECTION_POINT would run abc.pqr instead of uvw.lmn.
> Am I correct?

Yeah, that's an intended design choice to keep the code simpler and
faster as there is no need to track the library and function names in
the local caches or implement something similar to invalidation
messages for this facility because it would impact performance anyway
in the call paths.  In short, just don't do that, or use two distinct
points.

On Thu, Jan 04, 2024 at 06:22:35PM +0530, Ashutosh Bapat wrote:
> One more comment on 0001
> InjectionPointAttach() doesn't test whether the given function exists
> in the given library. Even if InjectionPointAttach() succeeds,
> INJECTION_POINT might throw error because the function doesn't exist.
> This can be seen as an unwanted behaviour. I think
> InjectionPointAttach() should test whether the function exists and
> possibly load it as well by adding it to the local cache.

This has the disadvantage of filling the local cache but that may not
be necessary with an extra load_external_function() in the attach
path.  I agree to make things safer, but I would do that when
attempting to run the callback instead.  Perhaps there's an argument
for the case of somebody replacing a library on-the-fly.  I don't
really buy it, but people like doing fancy things sometimes.

> 0002 comments
> --- /dev/null
> +++ b/src/test/modules/test_injection_points/expected/test_injection_points.out
>
> When built without injection point support, this test fails. We should
> add an alternate output file for such a build so that the behaviour
> with and without injection point support is tested. Or set up things
> such that the test is not run under make check in that directory. I
> will prefer the first option.

src/test/modules/Makefile has a safeguard for ./configure, and there's
one in test_injection_points/meson.build for Meson.  The test is not
run when the switches are not used, rather than using an alternate
output file.  There was a different issue when moving the tests to
src/test/recovery/, though, where we need to make the execution of the
tests conditional on get_option('injection_points').

> +
> +SELECT test_injection_points_run('TestInjectionError'); -- error
> +ERROR: error triggered for injection point TestInjectionError
> +-- Re-load and run again.
>
> What's getting Re-loaded here? \c will create a new connection and
> thus a new backend. Maybe the comment should say "test in a fresh
> backend" or something of that sort?

The local cache is reloaded.  Reworded.

> Looks confusing to me, we are testing the one removed as well. Am I
> missing something?
> [...]
> We aren't removing all entries TestInjectionLog2 is still there. Am I
> missing something?

Reworded all that.

> +# after recovery, the server will not start, and log PANIC: could not
> locate a valid checkpoint record
>
> IIUC the comment describes the behaviour with 7863ee4def65 reverted.
> But the test after this comment is written for the behaviour with
> 7863ee4def65. That's confusing. Is the intent to describe both
> behaviours in the comment?

This came from the original test case posted on the thread that
treated this bug.  There's more that bugs me for this script that I
would like to polish.  Let's focus on 0001 and 0002 for now..

> According to the prologue of ConditionVariableSleep(), that function
> should be called in a loop checking for the desired condition. All the
> callers that I examined follow that pattern. I think we need to follow
> that pattern here as well.
>
> Below comment from ConditionVariableTimedSleep() makes me think that
> the caller of ConditionVariableSleep() can be woken up even if the
> condition variable was not signaled. That's why the while() loop
> around ConditionVariableSleep().

That's the thing here, we don't have an extra condition to check
after.  The variable sleep is what triggers the stop.  :)
Perhaps this could be made smarter or with something else, I'm OK to
revisit that with the polishing for 0003 I'm planning.  We could use a
separate shared state, for example, but that does not improve the test
readability, either.

Attached is a v7 series.  What do you think?  0004 and 0005 for the
extra tests still need more discussion and much more polishing, IMO.
--
Michael

Attachment
On Tue, Jan 9, 2024 at 10:09 AM Michael Paquier <michael@paquier.xyz> wrote:

>
> Okay, I can see your point.  I have reorganized the docs in the
> following order:
> - INJECTION_POINT
> - Attach
> - Detach

Thanks. This looks better. Needs further wordsmithy. But that can wait
till the code has been reviewed.

>
> > I think, exposing the current injection point strings as
> > #defines and encouraging users to use these macros instead of string
> > literals will be a good start.
>
> Nah, I disagree with this one actually.  It is easy to grep for the
> macro INJECTION_POINT to be able to achieve the same research job, and
> this would make the code more inconsistent when callbacks are run
> within extensions which don't care about a #define in a backend
> header.

The macros should be in extension facing header, ofc. But I take back
this suggestion since defining these macros is extra work (every time
a new injection point is declared) and your suggestion to grep
practically works. We can improve things as needed.

>
> > With the current implementation it's possible to "declare" injection
> > point with same name at multiple places. It's useful but is it
> > intended?
>
> Yes.  I would recommend not doing that, but I don't see why there
> would be a point in restricting that, either.

Since an unintentional misspelling might trigger an unintended
injection point. But we will see how much of that happens in practice.

> > +#ifdef USE_INJECTION_POINTS
> > +static bool
> > +file_exists(const char *name)
> >
> > There's similar function in jit.c and dfmgr.c. Can we not reuse that code?
>
> This has been mentioned in a different comment.  Refactored as of
> 0001, but there is something here related to EACCES for the JIT path.
> Seems weird to me that we would not fail if the JIT library cannot be
> accessed when stat() fails.

I agree with this change to jit. Without having search permissions on
every directory in the path, the function can not determine if the
file exists or not. So throwing an error is better than just returning
false which means that
the file does not exist.

>
> > + /* Save the entry */
> > + memcpy(entry_by_name->name, name, sizeof(entry_by_name->name));
> > + entry_by_name->name[INJ_NAME_MAXLEN - 1] = '\0';
> > + memcpy(entry_by_name->library, library, sizeof(entry_by_name->library));
> > + entry_by_name->library[INJ_LIB_MAXLEN - 1] = '\0';
> > + memcpy(entry_by_name->function, function, sizeof(entry_by_name->function));
> > + entry_by_name->function[INJ_FUNC_MAXLEN - 1] = '\0';
> >
> > Most of the code is using strncpy instead of memcpy. Why is this code different?
>
> strncpy() is less used in the backend code.  It comes to a matter of
> taste, IMO.

To me using memcpy implies that the contents of the memory being
copied can be non-character. For a buffer containing a character
string I would prefer strncpy. But I wouldn't argue furher..

>
> Yeah, that's an intended design choice to keep the code simpler and
> faster as there is no need to track the library and function names in
> the local caches or implement something similar to invalidation
> messages for this facility because it would impact performance anyway
> in the call paths.  In short, just don't do that, or use two distinct
> points.

In practice the InjectionPointDetach() and InjectionPointAttach()
calls may not be close by and the user may not be able to figure out
why the injection points are behaviour weird. It may impact
performance but unexpected behaviour should be avoided, IMO.

If nothing else this should be documented.

>
> On Thu, Jan 04, 2024 at 06:22:35PM +0530, Ashutosh Bapat wrote:
> > One more comment on 0001
> > InjectionPointAttach() doesn't test whether the given function exists
> > in the given library. Even if InjectionPointAttach() succeeds,
> > INJECTION_POINT might throw error because the function doesn't exist.
> > This can be seen as an unwanted behaviour. I think
> > InjectionPointAttach() should test whether the function exists and
> > possibly load it as well by adding it to the local cache.
>
> This has the disadvantage of filling the local cache but that may not
> be necessary with an extra load_external_function() in the attach
> path.  I agree to make things safer, but I would do that when
> attempting to run the callback instead.  Perhaps there's an argument
> for the case of somebody replacing a library on-the-fly.  I don't
> really buy it, but people like doing fancy things sometimes.

I am ok with not populating the cache but checking with just
load_external_function(). This is again another ease of use scenario
where a silly mistake by user is caught earlier making user's life
easier. That at least should be the goal of the first cut.

>
> > 0002 comments
> > --- /dev/null
> > +++ b/src/test/modules/test_injection_points/expected/test_injection_points.out
> >
> > When built without injection point support, this test fails. We should
> > add an alternate output file for such a build so that the behaviour
> > with and without injection point support is tested. Or set up things
> > such that the test is not run under make check in that directory. I
> > will prefer the first option.
>
> src/test/modules/Makefile has a safeguard for ./configure, and there's
> one in test_injection_points/meson.build for Meson.  The test is not
> run when the switches are not used, rather than using an alternate
> output file.

With v6 I could run the test when built with enable_injection_point
false. I just ran make check in that folder. Didn't test meson build.

>  There was a different issue when moving the tests to
> src/test/recovery/, though, where we need to make the execution of the
> tests conditional on get_option('injection_points').

>
> > +
> > +SELECT test_injection_points_run('TestInjectionError'); -- error
> > +ERROR: error triggered for injection point TestInjectionError
> > +-- Re-load and run again.
> >
> > What's getting Re-loaded here? \c will create a new connection and
> > thus a new backend. Maybe the comment should say "test in a fresh
> > backend" or something of that sort?
>
> The local cache is reloaded.  Reworded.

We are starting a new backend not "re"loading a cache in an existing
backend per say.

>
> That's the thing here, we don't have an extra condition to check
> after.  The variable sleep is what triggers the stop.  :)
> Perhaps this could be made smarter or with something else, I'm OK to
> revisit that with the polishing for 0003 I'm planning.  We could use a
> separate shared state, for example, but that does not improve the test
> readability, either.

Yeah, I think we have to use another shared state. If the waiting
backend moves ahead without test_injection_point_wake() being called,
that could lead to unexpected and very weird behaviour.

It looks like ConditionVariable just remembers the processes that need
to be woken up during broadcast or signal. But by itself it doesn't
guarantee desired condition when woken up.

>
> Attached is a v7 series.  What do you think?  0004 and 0005 for the
> extra tests still need more discussion and much more polishing, IMO.

Generally I think the 0001 and 0002 are in good shape. However, I
would like them to be more easy to use - like catching simple user
errors that can be easily caught. That saves a lot of frustration
because of unexpected behaviour. I will review 0001 and 0002 from v7
in detail again, but it might take a few days.

--
Best Wishes,
Ashutosh Bapat



On Wed, Jan 10, 2024 at 03:21:03PM +0530, Ashutosh Bapat wrote:
> On Tue, Jan 9, 2024 at 10:09 AM Michael Paquier <michael@paquier.xyz> wrote:
>>> +#ifdef USE_INJECTION_POINTS
>>> +static bool
>>> +file_exists(const char *name)
>>>
>>> There's similar function in jit.c and dfmgr.c. Can we not reuse that code?
>>
>> This has been mentioned in a different comment.  Refactored as of
>> 0001, but there is something here related to EACCES for the JIT path.
>> Seems weird to me that we would not fail if the JIT library cannot be
>> accessed when stat() fails.
>
> I agree with this change to jit. Without having search permissions on
> every directory in the path, the function can not determine if the
> file exists or not. So throwing an error is better than just returning
> false which means that
> the file does not exist.

I was looking at the original set of threads related to JIT, and this
has been mentioned nowhere.  I think that I'm going to give it a shot
and see how the buildfarm reacts.  If that finishes with red, we could
always revert this part of the patch in jit.c still keep the
refactored routine.

>> Yeah, that's an intended design choice to keep the code simpler and
>> faster as there is no need to track the library and function names in
>> the local caches or implement something similar to invalidation
>> messages for this facility because it would impact performance anyway
>> in the call paths.  In short, just don't do that, or use two distinct
>> points.
>
> In practice the InjectionPointDetach() and InjectionPointAttach()
> calls may not be close by and the user may not be able to figure out
> why the injection points are behaviour weird. It may impact
> performance but unexpected behaviour should be avoided, IMO.
>
> If nothing else this should be documented.

In all the infrastructures I've looked at, folks did not really care
about having an invalidation for the callbacks loaded.  Still I'm OK
to add something in the documentation about that, say among the lines
of an extra sentence like:
"The callbacks loaded by a process are cached within each process.
There is no invalidation facility for the callbacks attached to
injection points, hence updating a callback for an injection point
requires a restart of the process to release its cache and the
previous callbacks attached to it."

> I am ok with not populating the cache but checking with just
> load_external_function(). This is again another ease of use scenario
> where a silly mistake by user is caught earlier making user's life
> easier. That at least should be the goal of the first cut.

I don't really aim for complicated here, just useful.

> With v6 I could run the test when built with enable_injection_point
> false. I just ran make check in that folder. Didn't test meson build.

The CI has been failing because 041_invalid_checkpoint_after_promote
was loading Time::HiRes::nanosleep and Windows does not support it.

> Yeah, I think we have to use another shared state. If the waiting
> backend moves ahead without test_injection_point_wake() being called,
> that could lead to unexpected and very weird behaviour.
>
> It looks like ConditionVariable just remembers the processes that need
> to be woken up during broadcast or signal. But by itself it doesn't
> guarantee desired condition when woken up.

Yeah, I'm not sure yet about how to do that in the most elegant way.
But this part could always happen after 0001~0003.

>> Attached is a v7 series.  What do you think?  0004 and 0005 for the
>> extra tests still need more discussion and much more polishing, IMO.
>
> Generally I think the 0001 and 0002 are in good shape. However, I
> would like them to be more easy to use - like catching simple user
> errors that can be easily caught. That saves a lot of frustration
> because of unexpected behaviour. I will review 0001 and 0002 from v7
> in detail again, but it might take a few days.

Thanks again for the reviews.  I still intend to focus solely on 0001,
0002 and 0003 for the current commit fest to have something able to
enforce error states in backends, at least.  There have been quite a
few bugs that could have coverage thanks for that.
--
Michael

Attachment
On Thu, Jan 11, 2024 at 9:42 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> >> Yeah, that's an intended design choice to keep the code simpler and
> >> faster as there is no need to track the library and function names in
> >> the local caches or implement something similar to invalidation
> >> messages for this facility because it would impact performance anyway
> >> in the call paths.  In short, just don't do that, or use two distinct
> >> points.
> >
> > In practice the InjectionPointDetach() and InjectionPointAttach()
> > calls may not be close by and the user may not be able to figure out
> > why the injection points are behaviour weird. It may impact
> > performance but unexpected behaviour should be avoided, IMO.
> >
> > If nothing else this should be documented.
>
> In all the infrastructures I've looked at, folks did not really care
> about having an invalidation for the callbacks loaded.  Still I'm OK
> to add something in the documentation about that, say among the lines
> of an extra sentence like:
> "The callbacks loaded by a process are cached within each process.
> There is no invalidation facility for the callbacks attached to
> injection points, hence updating a callback for an injection point
> requires a restart of the process to release its cache and the
> previous callbacks attached to it."

It doesn't behave exactly like that either. If the INJECTION_POINT is
run after detach (but before Attach), the local cache will be updated.
A subsequent attach and INJECTION_POINT call would fetch the new
callback.

>
> > I am ok with not populating the cache but checking with just
> > load_external_function(). This is again another ease of use scenario
> > where a silly mistake by user is caught earlier making user's life
> > easier. That at least should be the goal of the first cut.
>
> I don't really aim for complicated here, just useful.

It isn't complicated. Such simple error check improve user's
confidence on the feature and better be part of the 1st cut.

>
> > With v6 I could run the test when built with enable_injection_point
> > false. I just ran make check in that folder. Didn't test meson build.
>
> The CI has been failing because 041_invalid_checkpoint_after_promote
> was loading Time::HiRes::nanosleep and Windows does not support it.

Some miscommunication here. The SQL test under injection_point module
can be run in a build without injection_point and it fails. I think
it's better to have an alternate output for the same or prohibit the
test running itself.

--
Best Wishes,
Ashutosh Bapat



On Thu, Jan 11, 2024 at 02:47:27PM +0530, Ashutosh Bapat wrote:
> On Thu, Jan 11, 2024 at 9:42 AM Michael Paquier <michael@paquier.xyz> wrote:
>> I don't really aim for complicated here, just useful.
>
> It isn't complicated. Such simple error check improve user's
> confidence on the feature and better be part of the 1st cut.

I'm really not sure about that, because it does not impact the scope
of the facility even with all the use cases I've seen where injection
points could be used.  It could always be added later if there's a
strong push for it.  For testing, I'm biased about attempting to load
callbacks in the process attaching them.

>>> With v6 I could run the test when built with enable_injection_point
>>> false. I just ran make check in that folder. Didn't test meson build.
>>
>> The CI has been failing because 041_invalid_checkpoint_after_promote
>> was loading Time::HiRes::nanosleep and Windows does not support it.
>
> Some miscommunication here. The SQL test under injection_point module
> can be run in a build without injection_point and it fails. I think
> it's better to have an alternate output for the same or prohibit the
> test running itself.

The same problem exists if you try to run the SSL tests in
src/test/ssl/ without support build for them.  Protections at the
upper levels are good enough for the CI and the buildfarm, while
making the overall maintenance cheaper, so I'm happy with just these.

It also seems like you've missed this message, where this has been
mentioned (spoiler: first version of the patch used an alternate
output):
https://www.postgresql.org/message-id/ZUnuzPimkZCOVEcz@paquier.xyz
--
Michael

Attachment
On Fri, Jan 12, 2024 at 5:05 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Jan 11, 2024 at 02:47:27PM +0530, Ashutosh Bapat wrote:
> > On Thu, Jan 11, 2024 at 9:42 AM Michael Paquier <michael@paquier.xyz> wrote:
> >> I don't really aim for complicated here, just useful.
> >
> > It isn't complicated. Such simple error check improve user's
> > confidence on the feature and better be part of the 1st cut.
>
> I'm really not sure about that, because it does not impact the scope
> of the facility even with all the use cases I've seen where injection
> points could be used.  It could always be added later if there's a
> strong push for it.  For testing, I'm biased about attempting to load
> callbacks in the process attaching them.
>

I am not able to understand the objection to adding another handful of
lines of code. The core code is quite minimal and better to be robust.
We may seek someone else's opinion to break the tie.

> >>> With v6 I could run the test when built with enable_injection_point
> >>> false. I just ran make check in that folder. Didn't test meson build.
> >>
> >> The CI has been failing because 041_invalid_checkpoint_after_promote
> >> was loading Time::HiRes::nanosleep and Windows does not support it.
> >
> > Some miscommunication here. The SQL test under injection_point module
> > can be run in a build without injection_point and it fails. I think
> > it's better to have an alternate output for the same or prohibit the
> > test running itself.
>
> The same problem exists if you try to run the SSL tests in
> src/test/ssl/ without support build for them.  Protections at the
> upper levels are good enough for the CI and the buildfarm, while
> making the overall maintenance cheaper, so I'm happy with just these.
>
> It also seems like you've missed this message, where this has been
> mentioned (spoiler: first version of the patch used an alternate
> output):
> https://www.postgresql.org/message-id/ZUnuzPimkZCOVEcz@paquier.xyz

Ah! sorry for missing that. If there's a precedent, I am ok. If the
confusion arises we can fix it later.

--
Best Wishes,
Ashutosh Bapat



On Fri, Jan 12, 2024 at 08:35:42AM +0900, Michael Paquier wrote:
> It also seems like you've missed this message, where this has been
> mentioned (spoiler: first version of the patch used an alternate
> output):
> https://www.postgresql.org/message-id/ZUnuzPimkZCOVEcz@paquier.xyz

The refactoring of 0001 has now been applied as of e72a37528dda, and
the buildfarm looks stable (at least for now).

Here is a rebased patch set of the rest.
--
Michael

Attachment
Hi Michael,

There is some overlap between Dtrace functionality and this
functionality. But I see differences too. E.g. injection points offer
deeper integration whereas dtrace provides more information to the
probe like callstack and argument values etc. We need to assess
whether these functionality can co-exist and whether we need both of
them. If the answer to both of these questions is yes, it will be good
to add documentation explaining the differences and similarities and
also some guidance on when to use what.


On Fri, Jan 12, 2024 at 9:56 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Jan 12, 2024 at 08:35:42AM +0900, Michael Paquier wrote:
> > It also seems like you've missed this message, where this has been
> > mentioned (spoiler: first version of the patch used an alternate
> > output):
> > https://www.postgresql.org/message-id/ZUnuzPimkZCOVEcz@paquier.xyz
>
> The refactoring of 0001 has now been applied as of e72a37528dda, and
> the buildfarm looks stable (at least for now).
>
> Here is a rebased patch set of the rest.

+
+#ifdef USE_INJECTION_POINTS
+static bool
+file_exists(const char *name)
+{
+ struct stat st;
+
+ Assert(name != NULL);
+ if (stat(name, &st) == 0)
+ return !S_ISDIR(st.st_mode);
+ else if (!(errno == ENOENT || errno == ENOTDIR))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access file \"%s\": %m", name)));
+ return false;
+}

Shouldn't this be removed now? The code should use one from fd.c

Other code changes look good. I think the documentation and comments
need some changes esp. considering the users point of view. Have
attached two patches (0003, and 0004) with those changes to be applied
on top of 0001 and 0002 respectively. Please review them. Might need
some wordsmithy and language correction. Attaching the whole patch set
to keep cibot happy.

This is review of 0001 and 0002 only. Once we take care of these
comments I think those patches will be ready for commit except one
point of contention mentioned in [1]. We haven't heard any third
opinion yet.

[1] https://www.postgresql.org/message-id/CAExHW5sc_ar7=W9XCcC9TwYxZF71Ghc6poQ_+u4HXTXmNB7KAw@mail.gmail.com


--
Best Wishes,
Ashutosh Bapat

Attachment
On Thu, Jan 18, 2024 at 10:56:09AM +0530, Ashutosh Bapat wrote:
> There is some overlap between Dtrace functionality and this
> functionality. But I see differences too. E.g. injection points offer
> deeper integration whereas dtrace provides more information to the
> probe like callstack and argument values etc. We need to assess
> whether these functionality can co-exist and whether we need both of
> them. If the answer to both of these questions is yes, it will be good
> to add documentation explaining the differences and similarities and
> also some guidance on when to use what.

Perhaps, I'm not sure how much we want to do regarding that yet,
injection points have no external dependencies and will work across
all environments as long as dlsym() (or an equivalent) is able to
work, while being cheaper because they don't spawn an external process
to trace the call.

> +
> +#ifdef USE_INJECTION_POINTS
> +static bool
> +file_exists(const char *name)
> +{
> + struct stat st;
> +
> + Assert(name != NULL);
> + if (stat(name, &st) == 0)
> + return !S_ISDIR(st.st_mode);
> + else if (!(errno == ENOENT || errno == ENOTDIR))
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not access file \"%s\": %m", name)));
> + return false;
> +}
>
> Shouldn't this be removed now? The code should use one from fd.c

Yep, removed that.

> Other code changes look good. I think the documentation and comments
> need some changes esp. considering the users point of view. Have
> attached two patches (0003, and 0004) with those changes to be applied
> on top of 0001 and 0002 respectively. Please review them. Might need
> some wordsmithy and language correction. Attaching the whole patch set
> to keep cibot happy.

The CF bot was perhaps happy but your 0004 has forgotten to update the
expected output.  There were also a few typos, some markups and edits
required for 0002 but as a whole what you have suggested was an
improvement.  Thanks.

> This is review of 0001 and 0002 only. Once we take care of these
> comments I think those patches will be ready for commit except one
> point of contention mentioned in [1]. We haven't heard any third
> opinion yet.

0001~0004 have been now applied, and I'm marking the CF entry as
committed.  I'll create a new thread once I have put more energy into
the regression test improvements.  Now the fun can really begin.  I am
also going to switch my buildfarm animals to use the new ./configure
switch.
--
Michael

Attachment
On Mon, Jan 22, 2024 at 10:08 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Jan 18, 2024 at 10:56:09AM +0530, Ashutosh Bapat wrote:
> > There is some overlap between Dtrace functionality and this
> > functionality. But I see differences too. E.g. injection points offer
> > deeper integration whereas dtrace provides more information to the
> > probe like callstack and argument values etc. We need to assess
> > whether these functionality can co-exist and whether we need both of
> > them. If the answer to both of these questions is yes, it will be good
> > to add documentation explaining the differences and similarities and
> > also some guidance on when to use what.
>
> Perhaps, I'm not sure how much we want to do regarding that yet,
> injection points have no external dependencies and will work across
> all environments as long as dlsym() (or an equivalent) is able to
> work, while being cheaper because they don't spawn an external process
> to trace the call.

Yes. Both have their advantages and disadvantages. So I believe both
will stay but that means the guidance is necessary. We may want to see
reception and add the guidance later in the release cycle.

>
> > Other code changes look good. I think the documentation and comments
> > need some changes esp. considering the users point of view. Have
> > attached two patches (0003, and 0004) with those changes to be applied
> > on top of 0001 and 0002 respectively. Please review them. Might need
> > some wordsmithy and language correction. Attaching the whole patch set
> > to keep cibot happy.
>
> The CF bot was perhaps happy but your 0004 has forgotten to update the
> expected output.  There were also a few typos, some markups and edits
> required for 0002 but as a whole what you have suggested was an
> improvement.  Thanks.

Sorry for that. Glad that you found those suggestions acceptable.

--
Best Wishes,
Ashutosh Bapat



On 22/01/2024 06:38, Michael Paquier wrote:
> 0001~0004 have been now applied, and I'm marking the CF entry as
> committed.

Woo-hoo!

I wrote the attached patch to enable injection points in the Cirrus CI 
config, to run the injection tests I wrote for a GIN bug today [1]. But 
that led to a crash in the asan-enabled build [2]. I didn't investigate 
it yet.

[1] 
https://www.postgresql.org/message-id/d8f0b068-0e6e-4b2c-8932-62507eb7e1c6%40iki.fi
[2] https://cirrus-ci.com/task/5242888636858368

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment
On 22/01/2024 18:08, Heikki Linnakangas wrote:
> I wrote the attached patch to enable injection points in the Cirrus CI
> config, to run the injection tests I wrote for a GIN bug today [1]. But
> that led to a crash in the asan-enabled build [2]. I didn't investigate
> it yet.

Pushed a fix for the crash.

What do you think of enabling this in the Cirrus CI config?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




On Mon, Jan 22, 2024 at 09:02:48PM +0200, Heikki Linnakangas wrote:
> On 22/01/2024 18:08, Heikki Linnakangas wrote:
>> I wrote the attached patch to enable injection points in the Cirrus CI
>> config, to run the injection tests I wrote for a GIN bug today [1]. But
>> that led to a crash in the asan-enabled build [2]. I didn't investigate
>> it yet.
>
> Pushed a fix for the crash.

That's embarrassing.  Thanks for the quick fix.

> What do you think of enabling this in the Cirrus CI config?

That was on my TODO list of things to tackle and propose, but perhaps
there is no point in waiting more so I've applied your patch.
--
Michael

Attachment
On Tue, Jan 23, 2024 at 12:08:17PM +0900, Michael Paquier wrote:
> That was on my TODO list of things to tackle and propose, but perhaps
> there is no point in waiting more so I've applied your patch.

Slightly off topic and while I don't forget about it..  Please find
attached a copy of the patch posted around [1] to be able to define
injection points with input arguments, so as it is possible to execute
callbacks with values coming from the code path where the point is
attached.

For example, a backend could use this kind of macro to have a callback
attached to this point use some runtime value:
INJECTION_POINT_1ARG("InjectionPointBoo", &some_value);

[1]: https://www.postgresql.org/message-id/Za8TLyD9HIjzFlhJ%40paquier.xyz
--
Michael

Attachment