Re: On-demand running query plans using auto_explain and signals - Mailing list pgsql-hackers

From Shulgin, Oleksandr
Subject Re: On-demand running query plans using auto_explain and signals
Date
Msg-id CACACo5TedzSJpdrZzjwpkw3i6a8PH2TdLvWzmpf2S719KQnwPQ@mail.gmail.com
Whole thread Raw
In response to Re: On-demand running query plans using auto_explain and signals  ("Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de>)
Responses Re: On-demand running query plans using auto_explain and signals  (Pavel Stehule <pavel.stehule@gmail.com>)
List pgsql-hackers
On Mon, Sep 14, 2015 at 3:09 PM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:
On Mon, Sep 14, 2015 at 2:11 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Now the backend that has been signaled on the second call to
pg_cmdstatus (it can be either some other backend, or the backend B
again) will not find an unprocessed slot, thus it will not try to
attach/detach the queue and the backend A will block forever.

This requires a really bad timing and the user should be able to
interrupt the querying backend A still.

I think we can't rely on the low probability that this won't happen, and we should not rely on people interrupting the backend. Being able to detect the situation and fail gracefully should be possible.

It may be possible to introduce some lock-less protocol preventing such situations, but it's not there at the moment. If you believe it's possible, you need to explain and "prove" that it's actually safe.

Otherwise we may need to introduce some basic locking - for example we may introduce a LWLock for each slot, and lock it with dontWait=true (and skip it if we couldn't lock it). This should prevent most scenarios where one corrupted slot blocks many processes.

OK, I will revisit this part then.

I have a radical proposal to remove the need for locking: make the CmdStatusSlot struct consist of a mere dsm_handle and move all the required metadata like sender_pid, request_type, etc. into the shared memory segment itself.

If we allow the only the requesting process to update the slot (that is the handle value itself) this removes the need for locking between sender and receiver.

The sender will walk through the slots looking for a non-zero dsm handle (according to dsm_create() implementation 0 is considered an invalid handle), and if it finds a valid one, it will attach and look inside, to check if it's destined for this process ID.  At first that might sound strange, but I would expect 99% of the time that the only valid slot would be for the process that has been just signaled.

The sender process will then calculate the response message, update the result_code in the shared memory segment and finally send the message through the queue.  If the receiver has since detached we get a detached result code and bail out.

Clearing the slot after receiving the message should be the requesting process' responsibility.  This way the receiver only writes to the slot and the sender only reads from it.

By the way, is it safe to assume atomic read/writes of dsm_handle (uint32)?  I would be surprised if not.

--
Alex

pgsql-hackers by date:

Previous
From: Jim Nasby
Date:
Subject: Re: Do Layered Views/Relations Preserve Sort Order ?
Next
From: Jeff Janes
Date:
Subject: Re: Spurious standby query cancellations