Thread: fixing LISTEN/NOTIFY

fixing LISTEN/NOTIFY

From
Neil Conway
Date:
Applications that frequently use LISTEN/NOTIFY can suffer from
performance problems because of the MVCC bloat created by frequent
insertions into pg_listener. A solution to this has been suggested in
the past: rewrite LISTEN/NOTIFY to use shared memory rather than system
catalogs.

The problem is that there is a static amount of shared memory and a
potentially unbounded number of notifications, so we can run out of
memory. There are two ways to solve this: we can do as sinval does and
clear the shared memory queue, then effectively issue a NOTIFY ALL that
awakens all listeners. I don't like this behaviour: it seems ugly to
expose an implementation detail (static sizing of shared memory) to
applications. While a lot of applications are only using LISTEN/NOTIFY
for cache invalidation (and so spurious notifications are just a
performance hit), this behaviour still seems unfortunate to me. Using
NOTIFY ALL also makes NOTIFY 'msg' far less useful, which is a feature
several users have asked for in the past.

I think it would be better to either fail the NOTIFY when there is not
enough shared memory to add a new notification to the queue, or have the
NOTIFY block until shared memory does become available (applications
could of course implement the latter on top of the former by using
savepoints and a loop, either on the client-side or in PL/PgSQL). I
guess we could add an option to NOTIFY to specify how to handle
failures.

A related question is when to add the notification to the shared memory
queue. We don't want the notification to fire until the NOTIFY's
transaction commits, so one alternative would be to delay appending to
the queue until transaction-commit time. However, that would mean we
wouldn't notice NOTIFY failure until the end of the transaction, or else
that we would block waiting for free space during the transaction-commit
process. I think it would be better to add an entry to shared memory
during the NOTIFY itself, and stamp that entry with the NOTIFY's
toplevel XID. Other backends can read that the notification immediately
(and once all the backends have seen it, the notification can be removed
from the queue). Each backend can use the XID to determine when to
"fire" the notification (and if the notifying backend rolls back, they
can just discard the notification). This scheme is more expensive when
the notifying transaction rolls back, but I don't think that is the
common case.

Comments? (I'm still thinking about how to organize the shared memory
queue, and whether any of the sinval stuff can be reused...)

-Neil




Re: fixing LISTEN/NOTIFY

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> [ various ideas about reimplementing LISTEN/NOTIFY ]

I really dislike the idea of pushing notifies out to the shared queue
before commit.  That essentially turns "forever do notify foo" into
a global DOS tool: you can drive everybody else's backend into swap
hell along with your own.

The idea of blocking during commit until shmem becomes available might
work.  There's some issues here about transaction atomicity, though:
how do you guarantee that all or none of your notifies get sent?
(Actually, supposing that the notifies ought to be sent post-commit,
"all" is the only acceptable answer.  So maybe you just never give up.)
        regards, tom lane


Re: fixing LISTEN/NOTIFY

From
Neil Conway
Date:
On Thu, 2005-06-10 at 01:14 -0400, Tom Lane wrote:
> The idea of blocking during commit until shmem becomes available might
> work.  There's some issues here about transaction atomicity, though:
> how do you guarantee that all or none of your notifies get sent?
> (Actually, supposing that the notifies ought to be sent post-commit,
> "all" is the only acceptable answer.  So maybe you just never give up.)

Yeah, I think that would work. We could also write to shared memory
before the commit proper, and embed an XID in the message to allow other
backends to determine when/if to fire the notification.

However, I don't really like the idea of blocking the backend for a
potentially significant amount of time in a state half-way between
"committed" and "ready for the next query".

-Neil




Re: fixing LISTEN/NOTIFY

From
Alvaro Herrera
Date:
On Thu, Oct 06, 2005 at 01:32:32AM -0400, Neil Conway wrote:
> On Thu, 2005-06-10 at 01:14 -0400, Tom Lane wrote:
> > The idea of blocking during commit until shmem becomes available might
> > work.  There's some issues here about transaction atomicity, though:
> > how do you guarantee that all or none of your notifies get sent?
> > (Actually, supposing that the notifies ought to be sent post-commit,
> > "all" is the only acceptable answer.  So maybe you just never give up.)
> 
> Yeah, I think that would work. We could also write to shared memory
> before the commit proper, and embed an XID in the message to allow other
> backends to determine when/if to fire the notification.
> 
> However, I don't really like the idea of blocking the backend for a
> potentially significant amount of time in a state half-way between
> "committed" and "ready for the next query".

I don't like the idea of blocking indefinitely.  It means another global
DOS tool for anybody trying to NOTIFY: just do a LISTEN and sit there
doing nothing.

One idea would be to block for a while, with a timeout.  If it expires,
the receiving backend(s) has to copy the notification to local memory
and lets go of the one in shared memory.

-- 
Alvaro Herrera                        http://www.advogato.org/person/alvherre
"I'm always right, but sometimes I'm more right than other times."
(LinusTorvalds)
 


Re: fixing LISTEN/NOTIFY

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> However, I don't really like the idea of blocking the backend for a
> potentially significant amount of time in a state half-way between
> "committed" and "ready for the next query".

I wonder whether we could use something comparable to pg_multixact
or pg_subtrans, to convert the problem from one of "need to fit in
fixed amount of memory" to one of "it's on disk with some buffers
in memory".
        regards, tom lane


Re: fixing LISTEN/NOTIFY

From
Alvaro Herrera
Date:
On Thu, Oct 06, 2005 at 09:12:58AM -0400, Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
> > However, I don't really like the idea of blocking the backend for a
> > potentially significant amount of time in a state half-way between
> > "committed" and "ready for the next query".
> 
> I wonder whether we could use something comparable to pg_multixact
> or pg_subtrans, to convert the problem from one of "need to fit in
> fixed amount of memory" to one of "it's on disk with some buffers
> in memory".

The multixact mechanism seems a perfect fit to me (variable length
contents, identifiers produced serially and destroyed in a
not-too-distant future).  In fact I proposed it awhile back.

-- 
Alvaro Herrera                 http://www.amazon.com/gp/registry/CTMLCN8V17R4
"At least to kernel hackers, who really are human, despite occasional
rumors to the contrary" (LWN.net)


Re: fixing LISTEN/NOTIFY

From
Heikki Linnakangas
Date:
First of all, I'd like to give a name to the thing that a backend listens 
to. The code talks about "listening to a relation", but as the 
comments there point out, the name doesn't actually identify a relation. 
I'm going to call it a topic from now on.

I'd like to point out that we don't really have a notification queue, 
but a notification flag for each topic. We're not constrained by the 
number of notifications, but by the number of topics.

It might make sense to change the semantics so that we never lose a 
notification, if we're going to implement NOTIFY 'msg', but that's another 
discussion.

I've been thinking about the options for shmem data structure. Replacing 
pg_listener in the straightforward way would give us an array:

struct {  char topicname[NAMEDATALEN];  int listenerpid;  int notification;
} listenertable[max_subscriptions];

Where max_subscriptions is the maximum number of active LISTENs.

If we're ready to take the performance hit, we can exploit the fact that 
it's signal extra backends, as long as those backends can figure out that 
it was a false alarm. In fact, we can always signal all backends.

Exploiting that, we could have:

struct {  char topicname[NAMEDATALEN];  int subscriptions; /* number of active subscriptions for this topic */  int
notification_counter;/* increase by 1 on each NOTIFY */
 
} listenertable[max_topics]

NOTIFY increases the notification_counter by one, and signals all 
backends. Every backend keeps a private list of 
(topicname, notification_counter) pairs for the topics it's subscribed 
to, in addition to the shared memory table. The signal handler compares 
the notification_counter in the private list and in the listenertable. If 
they don't match, notify the client. If they match, it was a false alarm.

The shmem requirement of this is  max_topics * (NAMEDATALEN + sizeof(int) * 2)

If we're not ready to take the performance hit, we can add the list of 
backends to shmem:

struct {  char topicname[NAMEDATALEN];  int subscriptions; /* number of active subscriptions for this topic */  int
notification_counter;/* increase by 1 on each NOTIFY */  int subscribers[max_backends];
 
} listenertable[max_topics]

and only signal those backends that are in the subscribers array.

The shmem requirement of this is  max_topics * (NAMEDATALEN + sizeof(int) * (2 + max_backends))


We can also do a tradeoff between shmem usage and unnecessary signals:

struct {  char topicname[NAMEDATALEN];  int subscriptions; /* number of active subscriptions for this topic */  int
notification_counter;/* increase by 1 on each NOTIFY */  int subscriber_cache[cache_size];
 
} listenertable[max_topics]

Where cache_size can be any number  max_topics * (NAMEDATALEN + sizeof(int) * (2 + cache_size))

Where cache_size can be anything between 0 and max_backends. If the cache 
gets full, NOTIFY signals all backends. Otherwise, only those that are in 
the cache.


If max_topics is large, a hash table should be used instead of an array of 
structs.

Now that I think of it, using the notification_counter, we *can* guarantee 
that no notification is lost. The signal handler just needs to notify the 
client (shmem notification_counter) - (private notification_counter) 
times.


- Heikki

On Thu, 6 Oct 2005, Neil Conway wrote:

> Applications that frequently use LISTEN/NOTIFY can suffer from
> performance problems because of the MVCC bloat created by frequent
> insertions into pg_listener. A solution to this has been suggested in
> the past: rewrite LISTEN/NOTIFY to use shared memory rather than system
> catalogs.
>
> The problem is that there is a static amount of shared memory and a
> potentially unbounded number of notifications, so we can run out of
> memory. There are two ways to solve this: we can do as sinval does and
> clear the shared memory queue, then effectively issue a NOTIFY ALL that
> awakens all listeners. I don't like this behaviour: it seems ugly to
> expose an implementation detail (static sizing of shared memory) to
> applications. While a lot of applications are only using LISTEN/NOTIFY
> for cache invalidation (and so spurious notifications are just a
> performance hit), this behaviour still seems unfortunate to me. Using
> NOTIFY ALL also makes NOTIFY 'msg' far less useful, which is a feature
> several users have asked for in the past.
>
> I think it would be better to either fail the NOTIFY when there is not
> enough shared memory to add a new notification to the queue, or have the
> NOTIFY block until shared memory does become available (applications
> could of course implement the latter on top of the former by using
> savepoints and a loop, either on the client-side or in PL/PgSQL). I
> guess we could add an option to NOTIFY to specify how to handle
> failures.
>
> A related question is when to add the notification to the shared memory
> queue. We don't want the notification to fire until the NOTIFY's
> transaction commits, so one alternative would be to delay appending to
> the queue until transaction-commit time. However, that would mean we
> wouldn't notice NOTIFY failure until the end of the transaction, or else
> that we would block waiting for free space during the transaction-commit
> process. I think it would be better to add an entry to shared memory
> during the NOTIFY itself, and stamp that entry with the NOTIFY's
> toplevel XID. Other backends can read that the notification immediately
> (and once all the backends have seen it, the notification can be removed
> from the queue). Each backend can use the XID to determine when to
> "fire" the notification (and if the notifying backend rolls back, they
> can just discard the notification). This scheme is more expensive when
> the notifying transaction rolls back, but I don't think that is the
> common case.
>
> Comments? (I'm still thinking about how to organize the shared memory
> queue, and whether any of the sinval stuff can be reused...)
>
> -Neil
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>

- Heikki


Re: fixing LISTEN/NOTIFY

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> It might make sense to change the semantics so that we never lose a 
> notification, if we're going to implement NOTIFY 'msg', but that's another 
> discussion.

That's pretty much a given --- the ability to pass some payload data in
notifications has been on the TODO list for a very long time.  I don't
think we're going to reimplement listen/notify without adding it.

I like your suggestion of "topic" for the notify name, and am tempted to
go fix the documentation to use that term right now ...
        regards, tom lane


Re: fixing LISTEN/NOTIFY

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> I like your suggestion of "topic" for the notify name, and am tempted to
> go fix the documentation to use that term right now ...

Fwiw, I think the more conventional word here would be "channel". 
But whatever.

-- 
greg



Re: fixing LISTEN/NOTIFY

From
"Jim C. Nasby"
Date:
On Thu, Oct 06, 2005 at 10:30:24AM -0400, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > It might make sense to change the semantics so that we never lose a 
> > notification, if we're going to implement NOTIFY 'msg', but that's another 
> > discussion.
> 
> That's pretty much a given --- the ability to pass some payload data in
> notifications has been on the TODO list for a very long time.  I don't
> think we're going to reimplement listen/notify without adding it.

Maybe I'm missing something, but is it possible to ensure notifications
aren't lost using Heikki's method, since everything's only in shared
memory? Or is the idea that stuff would not survive a backend crash?
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: fixing LISTEN/NOTIFY

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Maybe I'm missing something, but is it possible to ensure notifications
> aren't lost using Heikki's method, since everything's only in shared
> memory? Or is the idea that stuff would not survive a backend crash?

Listen/notify state has never survived a crash (since it is defined in
terms of PIDs that will no longer exist after a DB restart), and I don't
really see any reason why we'd want it to.  An application reconnecting
after a DB crash would have to assume it might have missed some
notifications occurring before it could reconnect, and would have to
re-determine what the database state is anyway.

But I think you might be confusing that with the feature-or-bug
(depending on one's point of view) that duplicate notifications can be
merged into one event.  I'm inclined to preserve that behavior,
primarily because not doing so would create a tremendous penalty on
applications that expect it to work that way.  With addition of payload
data it'd be easy for apps that don't want merging to prevent it: just
add an otherwise-uninteresting serial number to the payload string.
We'd certainly want to define the "duplicate" test to consider the
payload string as well as the topic name.
        regards, tom lane


Re: fixing LISTEN/NOTIFY

From
"Jim C. Nasby"
Date:
On Sat, Oct 08, 2005 at 04:59:22PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > Maybe I'm missing something, but is it possible to ensure notifications
> > aren't lost using Heikki's method, since everything's only in shared
> > memory? Or is the idea that stuff would not survive a backend crash?
> 
> Listen/notify state has never survived a crash (since it is defined in
> terms of PIDs that will no longer exist after a DB restart), and I don't
> really see any reason why we'd want it to.  An application reconnecting
> after a DB crash would have to assume it might have missed some
> notifications occurring before it could reconnect, and would have to
> re-determine what the database state is anyway.
> 
> But I think you might be confusing that with the feature-or-bug
> (depending on one's point of view) that duplicate notifications can be
> merged into one event.  I'm inclined to preserve that behavior,
> primarily because not doing so would create a tremendous penalty on
> applications that expect it to work that way.  With addition of payload
> data it'd be easy for apps that don't want merging to prevent it: just
> add an otherwise-uninteresting serial number to the payload string.
> We'd certainly want to define the "duplicate" test to consider the
> payload string as well as the topic name.

I thought the idea behind NOTIFY 'msg' was to enable some form of
queuing, or at least something that ensures the notification sticks
around until picked up. I see I was wrong. :)

Before ripping out the old code, would it be useful as the basis for a
queue/notification system that ensures a message sticks around until at
least one listener picks it up? Granted, I think such a thing could be
built using the new notification system along with a table, but if the
code's already there and useful...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: fixing LISTEN/NOTIFY

From
"Neil Conway"
Date:
Tom Lane said:
> But I think you might be confusing that with the feature-or-bug
> (depending on one's point of view) that duplicate notifications can be
> merged into one event.  I'm inclined to preserve that behavior,
> primarily because not doing so would create a tremendous penalty on
> applications that expect it to work that way.

What sort of application are you envisioning?

If you mean there are applications for which avoiding duplicate
notifications is a performance win, I think those applications are on
shakey ground: the LISTEN/NOTIFY interface doesn't guarantee that no
duplicate notifications will be delivered (it merely doesn't guarantee
they *will* be delivered).

Personally, I think delivering all notifications by default is simpler
behavior for the application programmer to understand. If you want to
avoid doing work for duplicate notifications, you realistically need to
implement that yourself anyway.

-Neil




Re: fixing LISTEN/NOTIFY

From
Josh Berkus
Date:
Neil, Jim, All:

> Personally, I think delivering all notifications by default is simpler
> behavior for the application programmer to understand. If you want to
> avoid doing work for duplicate notifications, you realistically need to
> implement that yourself anyway.

I agree with Neil; I don't really see duplicate notifications as a problem, 
provided that we can deliver a message with the notice.  As long as we can 
deliver a PK id or other unique information, then we can still use NOTIFY to 
do, for example, asynchronous summary counts.

However, I really dislike the idea of blocking on out-of-memory NOTICEs.  
While I can see some DBAs wanting that, for most applications I'd want to 
implement, I'd rather that the NOTICE were thrown away than have it stop a 
query.  LISTEN/NOTIFY is a asynchronous messaging mechanism, after all -- the 
whole idea is for it *not* to slow down transactions.

Ideally, we'd have three options which could be set administratively (at 
startup) in postgresql.conf:
notify_mem = 16384 #in K
notify_mem_full = discard #one of: discard, block, or disk# discard = simply drop notices over the memory limit# block
=stop queries from committing until notice memory is open# disk = spill notices to the pg_notice file
 

You'll notice that I added a spill-to-disk on NOTIFY; I think that should be a 
TODO even if we don't implement it immediately.

We should also have a way to detect the situation of notify_mem being full, 
both through log notices and by command-line function.   Finally, we'd need a 
function for the superuser to clear all notices.

I'm thinking here of my personal application for a better LISTEN/NOTIFY: using 
pg_memcached to store fast asynchronous materialized views.  The idea would 
be that you could maintain a materialized view in memory that was no more 
than 30 seconds behind, by:
[every 15 seconds]:
check if there are any notices on the viewif so, regenerate the rows indicated in the notify messageif not, next
check if notify_mem is fullif so, regenerate all mattviews from scratchsend clear_notices();

Hopefully people get the idea?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: fixing LISTEN/NOTIFY

From
Tom Lane
Date:
"Neil Conway" <neilc@samurai.com> writes:
> Tom Lane said:
>> I'm inclined to preserve that behavior,
>> primarily because not doing so would create a tremendous penalty on
>> applications that expect it to work that way.

> What sort of application are you envisioning?

The ones that have a per-row trigger that does "NOTIFY foo".  In the
past this would deliver one event per transaction; changing that to one
per row is going to kill them.

I'm not very concerned about whether similar events issued by different
transactions are merged or not --- as you say, one could never rely on
that to happen anyway because of timing.  But one event per transaction
has been a reliable behavior and I think it would be bad to change it.
        regards, tom lane