Thread: Listen / Notify rewrite

Listen / Notify rewrite

From
Joachim Wieland
Date:
Hi,

Attached is a patch for a new listen/notify implementation.

In a few words, the patch reimplements listen/notify as an slru-based queue
which works similar to the sinval structure. Essentially it is a ring buffer on
disk with pages mapped into shared memory for read/write access.

Additionally the patch does the following (see below for details):

 1. It removes the pg_listener relation and
 2. adds the possibility to specify a payload parameter, i.e. executing in SQL
    "NOTIFY foo 'payload';" and 'payload' will be delivered to any listening
    backend.
 3. Every distinct notification is delivered.
 4. Order is preserved, i.e. if txn 1 first does NOTIFY foo, then NOTIFY bar, a
    backend (listening to both "foo" and "bar") will always first receive the
    notification "foo" and then the notification "bar".
 5. It's now "listen to a channel", not "listen to a relation" anymore...


Details:

1. Instead of placing the queue into shared memory only I propose to create a
new subdirectory pg_notify/ and make the queue slru-based, such that we do not
risk blocking. Several people here have pointed out that blocking is a true
no-go for a new listen/notify implementation. With an slru-based queue we have
so much space that blocking is really unlikely even in periods with extreme
notify bursts.
Regarding performance, the slru-queue is not fsync-ed to disk so most activity
would be in the OS file cache memory anyway and most backends will probably
work on the same pages most of the time. However more locking overhead is
required in comparison to a shared-memory-only implementation.

There is one doubt that I have: Currently the patch adds notifications to the
queue after the transaction has committed to clog. The advantage is that we do
not need to take care of visibility. When we add notifications to the queue, we
have committed our transaction already and all reading backends are not in a
transaction anyway, so everything is visible to everyone and we can just write
to and read from the queue.

However, if for some reason we cannot write to the slru files in the pg_notify/
directory we might want to roll back the current transaction but with the
proposed patch we cannot because we have already committed...
But... if there is a problem with the pg_notify/ directory, then something is
fundamentally wrong on the file system of the database server and pg_subtrans/
and pg_clog/ are probably affected by the same problem... One possible solution
would be to write to the queue before committing and adding the TransactionID.
Then other backends can check if our TransactionID has successfully committed
or not. Not sure if this is worth the overhead however...

2. The payload parameter is optional. A notifying client can either call
"NOTIFY foo;" or "NOTIFY foo 'payload';". The length of the payload is
currently limited to 128 characters... Not sure if we should allow longer
payload strings... If there is more complex data to be transferred, the sending
transaction can always put all of that data into a relation and just send the
id of the entry. If no payload is specified, then this is treated internally
like an empty payload. Consequently an empty string will be delivered as the
payload to the listening backend.

3. Not every notification is delivered, only distinct notifications (per
transaction).
In other words, each sending transaction eliminates duplicate notifications
that have the exact same payload and channel name as another notification that
is already in the queue to be sent out.

Should we have an upper limit on the number of notifications that a transaction
is allowed to send? Without an upper limit, a client can open a transaction and
send a series of NOTIFYs, each with a different payload until its backend runs
out of memory...

4.
Sending Transaction does:     This will be delivered (on commit):

NOTIFY foo;                       NOTIFY foo;
NOTIFY foo;                       NOTIFY bar 'value1';
NOTIFY bar 'value1';              NOTIFY bar 'value2';
NOTIFY foo;
NOTIFY bar 'value2';
NOTIFY bar 'value1';

Note that we do _not_ guarantee that notifications from transaction 1 are
always delivered before the notifications of transaction 2 just because
transaction 1 committed before transaction 2.


Let me know what you think,

Regards,
Joachim

Attachment

Re: Listen / Notify rewrite

From
Martijn van Oosterhout
Date:
On Wed, Nov 11, 2009 at 10:25:05PM +0100, Joachim Wieland wrote:
> Hi,
>
> Attached is a patch for a new listen/notify implementation.
>
> In a few words, the patch reimplements listen/notify as an slru-based queue
> which works similar to the sinval structure. Essentially it is a ring buffer on
> disk with pages mapped into shared memory for read/write access.

While I can't really comment on the implementation, from your
description it looks like a big improvement.

Nice work!

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Listen / Notify rewrite

From
"A.M."
Date:
On Nov 11, 2009, at 4:25 PM, Joachim Wieland wrote:

> Hi,
>
> Attached is a patch for a new listen/notify implementation.
>
> In a few words, the patch reimplements listen/notify as an slru- 
> based queue
> which works similar to the sinval structure. Essentially it is a  
> ring buffer on
> disk with pages mapped into shared memory for read/write access.
>
> Additionally the patch does the following (see below for details):
>
> 1. It removes the pg_listener relation and
> 2. adds the possibility to specify a payload parameter, i.e.  
> executing in SQL
>    "NOTIFY foo 'payload';" and 'payload' will be delivered to any  
> listening
>    backend.
> 3. Every distinct notification is delivered.
> 4. Order is preserved, i.e. if txn 1 first does NOTIFY foo, then  
> NOTIFY bar, a
>    backend (listening to both "foo" and "bar") will always first  
> receive the
>    notification "foo" and then the notification "bar".
> 5. It's now "listen to a channel", not "listen to a relation"  
> anymore...

Hi Joachim,

Thank you for implementing this- LISTEN/NOTIFY without a payload has  
been a major problem to work around for me.

I understand that coalescing multiple notifications of the same name  
happens now, but I never understood why. I had hoped that someone  
revisiting this "feature" would remove it. My use case is autonomous  
transactions.
From a developer's standpoint, NOTIFY specifies a form of remote  
trigger for further action- outside the transaction- to occur. From  
that point of view, I don't see why NOTIFY foo; NOTIFY foo; is  
equivalent to NOTIFY foo;. I understand the use case where a per-row  
trigger could generate lots of spurious notifications, but it seems  
that if anything is generating spurious notifications, the  
notification is in the wrong place.

The documentation makes the strong implication that notification names  
are usually table names, but with the new payload, this becomes even  
less the case.

""I changed this table, take a look at it to see what's new". But no  
such association is enforced by the NOTIFY and LISTEN commands."
http://www.postgresql.org/docs/8.4/static/sql-notify.html

With the coalescing, I feel like I am being punished for using NOTIFY  
with something other than a table name use case.

At least with this new payload, I can set the payload to the  
transaction ID and be certain that all the notifications I sent are  
processed (and in order even!) but could you explain why the  
coalescing is still necessary?

While coalescing could be achieved on the receiving-end NOTIFY  
callback ( if(payload ID was already processed) continue; ), non- 
coalescing behavior cannot be achieved when the backend does the  
coalescing.

For backwards compatibility, payload-less NOTIFY could coalesce while  
NOTIFYs with a payload would not coalesce, but, on the other hand,  
that might be confusing.

In any case, thank you for improving this long-neglected subsystem!

Best regards,
M


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
>> 2. adds the possibility to specify a payload parameter, i.e. executing 
>> in SQL
>>    "NOTIFY foo 'payload';" and 'payload' will be delivered to any 
>> listening
>>    backend.
> 
> Thank you for implementing this- LISTEN/NOTIFY without a payload has 
> been a major problem to work around for me.

+1

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
   /*
+  * This function is executed for every notification found in the queue in order
+  * to check if the current backend is listening on that channel. Not sure if we
+  * should further optimize this, for example convert to a sorted array and
+  * allow binary search on it...
+  */
+ static bool
+ IsListeningOn(const char *channel)


I think a bsearch would be needed.  Very busy servers that make heavy use of 
notifies would be quite a penalty.

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
Joachim Wieland
Date:
On Thu, Nov 12, 2009 at 1:04 AM, Andrew Chernow <ac@esilo.com> wrote:
> I think a bsearch would be needed.  Very busy servers that make heavy use of
> notifies would be quite a penalty.

In such an environment, how many relations/channels is a backend
typically listening to?
Do you have average / maximal numbers?


Regards,
Joachim


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
Joachim Wieland wrote:
> On Thu, Nov 12, 2009 at 1:04 AM, Andrew Chernow <ac@esilo.com> wrote:
>> I think a bsearch would be needed.  Very busy servers that make heavy use of
>> notifies would be quite a penalty.
> 
> In such an environment, how many relations/channels is a backend
> typically listening to?
> Do you have average / maximal numbers?
> 

We have a system where many libpq clients, ~2000 - 4000 per server (depends on 
hardware), maintain a persistent backend connection.  Each connection listens 
for notifies, LISTEN 'client_xxx'.  There are maybe 10 different reasons why a 
NOTIFY 'client_xxx' is fired.  Sometimes, notifies are broadcasted to all client 
connections, or just portions of them.

The goal is real-time messaging to a large groups of computers/devices.  Past 
4000, the problem is distributed to a second, third, etc... server.

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
Tom Lane
Date:
Andrew Chernow <ac@esilo.com> writes:
> +  * This function is executed for every notification found in the queue in order
> +  * to check if the current backend is listening on that channel. Not sure if we
> +  * should further optimize this, for example convert to a sorted array and
> +  * allow binary search on it...

> I think a bsearch would be needed.  Very busy servers that make heavy use of 
> notifies would be quite a penalty.

Premature optimization is the root of all evil ;-).  Unless you've done
some profiling and can show that this is a hot spot, making it more
complicated isn't the thing to be doing now.
        regards, tom lane


Re: Listen / Notify rewrite

From
Andrew Gierth
Date:
>>>>> "Martijn" == Martijn van Oosterhout <kleptog@svana.org> writes:
>> Hi,>> >> Attached is a patch for a new listen/notify implementation.>> >> In a few words, the patch reimplements
listen/notifyas an>> slru-based queue which works similar to the sinval>> structure. Essentially it is a ring buffer on
diskwith pages>> mapped into shared memory for read/write access.
 
Martijn> While I can't really comment on the implementation, from yourMartijn> description it looks like a big
improvement.

Does it cope with the case where a trigger is doing NOTIFY, and you do
a whole-table update, therefore dumping potentially millions of
notifications in at once?

(for example a rare maintenance operation on a table which has a
listen/notify arrangement triggered by single inserts or updates)

The existing implementation copes with that just fine.

-- 
Andrew (irc:RhodiumToad)


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
> Premature optimization is the root of all evil ;-).  Unless you've done
> some profiling and can show that this is a hot spot, making it more
> complicated isn't the thing to be doing now.
> 

I'm thinking of how our system uses/abuses notifies, and began wondering 
if several thousand backends listening with a large queue would perform 
decently behind a linear search.  At this point, I have no data either 
way; only an assumption based off being burnt by sequential scans in the 
past ;)

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
Merlin Moncure
Date:
On Wed, Nov 11, 2009 at 5:48 PM, A.M. <agentm@themactionfaction.com> wrote:
> At least with this new payload, I can set the payload to the transaction ID
> and be certain that all the notifications I sent are processed (and in order
> even!) but could you explain why the coalescing is still necessary?

Christmas comes early this year! :-).

three reasons:
*) it works that way now...a lot of people use this feature for all
kinds of subtle things and the behavior chould change as little as
possible
*) legacy issues aside, I think it's generally better behavior (how
many times do you need to be tapped on the shoulder?)
*) since you can trivially differentiate it (using xid, sequence,
etc), what's the fuss?

merlin


Re: Listen / Notify rewrite

From
A.M.
Date:
On Nov 11, 2009, at 9:28 PM, Merlin Moncure wrote:

> On Wed, Nov 11, 2009 at 5:48 PM, A.M.  
> <agentm@themactionfaction.com> wrote:
>> At least with this new payload, I can set the payload to the  
>> transaction ID
>> and be certain that all the notifications I sent are processed  
>> (and in order
>> even!) but could you explain why the coalescing is still necessary?
>
> Christmas comes early this year! :-).
>
> three reasons:
> *) it works that way now...a lot of people use this feature for all
> kinds of subtle things and the behavior chould change as little as
> possible
> *) legacy issues aside, I think it's generally better behavior (how
> many times do you need to be tapped on the shoulder?)
> *) since you can trivially differentiate it (using xid, sequence,
> etc), what's the fuss?

Except for the fact that the number of times a notification occurred  
may be valuable information.

I thought of a compromise: add the number of times a notification was  
generated (coalesced count+1) to the callback data. That would  
satisfy any backwards compatibility concerns and my use case too!

Cheers,
M


Re: Listen / Notify rewrite

From
Tom Lane
Date:
Joachim Wieland <joe@mcknight.de> writes:
> However, if for some reason we cannot write to the slru files in the pg_notify/
> directory we might want to roll back the current transaction but with the
> proposed patch we cannot because we have already committed...

I think this is a deal-breaker, and arguing about how the pg_notify
directory ought to be writable is not even slightly acceptable --- out
of disk space is an obvious risk.  The existing implementation
guarantees that notifications obey ACID: if you commit, they are sent,
and if you don't, they aren't.  You can't just put in something that
works most of the time instead.

> One possible solution would be to write to the queue before committing
> and adding the TransactionID.  Then other backends can check if our
> TransactionID has successfully committed or not. Not sure if this is
> worth the overhead however...

That sounds reasonable, and it's certainly no more overhead than the
tuple visibility checks that happen in the current implementation.

> 2. The payload parameter is optional. A notifying client can either call
> "NOTIFY foo;" or "NOTIFY foo 'payload';". The length of the payload is
> currently limited to 128 characters... Not sure if we should allow longer
> payload strings...

Might be a good idea to make the max the same as the max length for
prepared transaction GUIDs?  Not sure anyone would be shipping those
around, but it's a pre-existing limit of about the same size.
        regards, tom lane


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
> 
> I thought of a compromise: add the number of times a notification was 
> generated (coalesced count+1) to the callback data. That would satisfy 
> any backwards compatibility concerns and my use case too!
> 

If you are suggesting that the server poke data into the notifier's opaque 
payload, I vote no.  Maybe the NOTIFY command can include a switch to enable 
this behavior.  No syntax suggestions at this point.

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
Tom Lane
Date:
Andrew Chernow <ac@esilo.com> writes:
>> I thought of a compromise: add the number of times a notification was 
>> generated (coalesced count+1) to the callback data. That would satisfy 
>> any backwards compatibility concerns and my use case too!

> If you are suggesting that the server poke data into the notifier's opaque 
> payload, I vote no.  Maybe the NOTIFY command can include a switch to enable 
> this behavior.  No syntax suggestions at this point.

I agree, we should not have the system modifying the payload string for
this.  And don't bother suggesting a third column in the result ---
we'd have to change the FE/BE protocol for that, and it's not going
to happen.

The existing precedent is that the system collapses identical
notifications without payloads.  So we could possibly get away with
saying that identical payload-less notifies are collapsed but those
with a payload are not.  That doesn't really seem to satisfy the
POLA though.  I think Joachim's definition is fine, and anyone who
needs delivery of distinct notifications can easily make his payload
strings unique to ensure it.
        regards, tom lane


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
>> 2. The payload parameter is optional. A notifying client can either call
>> "NOTIFY foo;" or "NOTIFY foo 'payload';". The length of the payload is
>> currently limited to 128 characters... Not sure if we should allow longer
>> payload strings...
> 
> Might be a good idea to make the max the same as the max length for
> prepared transaction GUIDs?  Not sure anyone would be shipping those
> around, but it's a pre-existing limit of about the same size.
> 

I don't see any reason to impose such a small limit on payload size.  Surely 
some limit must exist, but 128 characters seems awfully small.  I already have 
at few places in mind that would require more bytes.

Why not 8K, 64K, 256K, 1M or even more?  Is there some other factor in play 
forcing this limitation?

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
"A.M."
Date:
On Nov 11, 2009, at 10:43 PM, Tom Lane wrote:

> Andrew Chernow <ac@esilo.com> writes:
>>> I thought of a compromise: add the number of times a notification  
>>> was
>>> generated (coalesced count+1) to the callback data. That would  
>>> satisfy
>>> any backwards compatibility concerns and my use case too!
>
>> If you are suggesting that the server poke data into the  
>> notifier's opaque
>> payload, I vote no.  Maybe the NOTIFY command can include a switch  
>> to enable
>> this behavior.  No syntax suggestions at this point.
>
> I agree, we should not have the system modifying the payload string  
> for
> this.  And don't bother suggesting a third column in the result ---
> we'd have to change the FE/BE protocol for that, and it's not going
> to happen.
>
> The existing precedent is that the system collapses identical
> notifications without payloads.  So we could possibly get away with
> saying that identical payload-less notifies are collapsed but those
> with a payload are not.  That doesn't really seem to satisfy the
> POLA though.  I think Joachim's definition is fine, and anyone who
> needs delivery of distinct notifications can easily make his payload
> strings unique to ensure it.

The notification count could be a secondary "payload" which does not  
affect the first, but I guess I'm the only one complaining about the  
coalescing...

-M


Re: Listen / Notify rewrite

From
Joachim Wieland
Date:
On Thu, Nov 12, 2009 at 2:12 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
> Does it cope with the case where a trigger is doing NOTIFY, and you do
> a whole-table update, therefore dumping potentially millions of
> notifications in at once?
>
> (for example a rare maintenance operation on a table which has a
> listen/notify arrangement triggered by single inserts or updates)
>
> The existing implementation copes with that just fine.

As long as you use the existing way to send out notifications by just
sending "NOTIFY foo", then yes, it will cope with millions of them
just fine because it will collapse them into one notification as does
the current implementation (contrary to the current implementation a
notification will be received from every transaction that has sent one
- the current implementation even collapses notifications from
different transactions).

However if you decide to use the new syntax and add a distinct payload
to every NOTIFY, then you also send out millions of notifications...

With the proposed patch the queue has space for up to about 81,787,680
notifications. The limits are mainly imposed by slru.c which defines
page size, pages per segment and the number of segments available. We
could easily bump that up to a multiple of the current limits if we
have decided that we need to...


Joachim


Re: Listen / Notify rewrite

From
Joachim Wieland
Date:
On Thu, Nov 12, 2009 at 4:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> One possible solution would be to write to the queue before committing
>> and adding the TransactionID.  Then other backends can check if our
>> TransactionID has successfully committed or not. Not sure if this is
>> worth the overhead however...
>
> That sounds reasonable, and it's certainly no more overhead than the
> tuple visibility checks that happen in the current implementation.

I am not too concerned about the runtime of the visibility checks,
instead I suppose that most of the overhead will come from waiting for
another transaction to either commit or abort...

If transaction t1 scans the queue and at some point finds
notifications from t2, then it will ask for the status of t2. If it
finds out that t2 is still running, then it has no other option than
to stop working on the queue and wait (it will be signalled again
later once t2 has finished).

This also means that we cannot at the same time write notifications to
the queue and read from it and if a transaction places a few million
notifications into the queue, readers need to wait until it has
finished and only after that they can continue and read the
notifications...

And it means that if the queue is full, we might run into a
deadlock... A transaction adding notifications will wait for the
readers to proceed and the readers wait for the transaction to commit
or abort...

One option could be to write new notifications to a subdirectory, and
create a bunch of new segment files there. Once this is done, the
segment files could be moved over and renamed, so that they continue
the slru queue... If we run out of disk space while filling that
temporary subdirectory, then we can just delete the subdirectory and
nobody has been blocked. We could still run into errors moving and
renaming the segment files (e.g. permission problems) so that we still
might need to abort the transaction...


>> 2. The payload parameter is optional. A notifying client can either call
>> "NOTIFY foo;" or "NOTIFY foo 'payload';". The length of the payload is
>> currently limited to 128 characters... Not sure if we should allow longer
>> payload strings...
>
> Might be a good idea to make the max the same as the max length for
> prepared transaction GUIDs?  Not sure anyone would be shipping those
> around, but it's a pre-existing limit of about the same size.

Yes, sounds reasonable to have the same limit for user-defined identifiers...


Joachim


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
>>> 2. The payload parameter is optional. A notifying client can either call
>>> "NOTIFY foo;" or "NOTIFY foo 'payload';". The length of the payload is
>>> currently limited to 128 characters... Not sure if we should allow longer
>>> payload strings...
>> Might be a good idea to make the max the same as the max length for
>> prepared transaction GUIDs?  Not sure anyone would be shipping those
>> around, but it's a pre-existing limit of about the same size.
> 
> Yes, sounds reasonable to have the same limit for user-defined identifiers...
> 

[..begging..] Can this be increased significantly?  I don't get it, is there any 
technical reason to make the limit soo small?  This drastically reduces the 
usefulness of the payload.  I've wanted this feature for quite sometime and it 
is quite disappointing that I could not even use it because it is unjustifiably 
limited.

One use case I need is making the payload an absolute path, which saves us a 
round trip (commonly internet latency) and a query in a section of the system 
that's extremely performance sensitive.  That sure ain't going to fit in 128 
bytes.

I'm sure I'm not the only one who finds this limit too small.  I can almost 
guarentee complaints would come in if released that way.

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
Merlin Moncure
Date:
On Thu, Nov 12, 2009 at 8:25 AM, Andrew Chernow <ac@esilo.com> wrote:
>
>>>> 2. The payload parameter is optional. A notifying client can either call
>>>> "NOTIFY foo;" or "NOTIFY foo 'payload';". The length of the payload is
>>>> currently limited to 128 characters... Not sure if we should allow
>>>> longer
>>>> payload strings...
>>>
>>> Might be a good idea to make the max the same as the max length for
>>> prepared transaction GUIDs?  Not sure anyone would be shipping those
>>> around, but it's a pre-existing limit of about the same size.
>>
>> Yes, sounds reasonable to have the same limit for user-defined
>> identifiers...
>>
>
> [..begging..] Can this be increased significantly?  I don't get it, is there
> any technical reason to make the limit soo small?  This drastically reduces
> the usefulness of the payload.  I've wanted this feature for quite sometime
> and it is quite disappointing that I could not even use it because it is
> unjustifiably limited.

+1

What advantage is there in limiting it to a tiny size?  This is a
'payload' after all...an arbitrary data block. Looking at the patch I
noticed the payload structure (AsyncQueueEntry) is fixed length and
designed to lay into QUEUE_PAGESIZE (set to) BLCKSZ sized pages.

Couple of questions:

*) is BLCKSZ a hard requirement, that is, coming from the slru
implementation, or can QUEUE_PAGESIZE be bumped independently of block
size.

*) why not make the AsyncQueueEntry divide evenly into BLCKSZ, that
is, make the whole structure a size that is a multiple of two?  (this
would make the payload length 'weird')

*) is there any downside you see to making the AsyncQueueEntry
structure exactly BLCKSZ bytes in size?  Are we worried about the
downsides of spinning the notifications out to disk?

*) Is a variable length AsyncQueueEntry possible? (presumably bounded
by the max page size).  Or does complicate the implementation too
much?

merlin





merlin


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
> 
> What advantage is there in limiting it to a tiny size?  This is a
> 'payload' after all...an arbitrary data block. Looking at the patch I
> noticed the payload structure (AsyncQueueEntry) is fixed length and
> designed to lay into QUEUE_PAGESIZE (set to) BLCKSZ sized pages.
> 

Hmmmm.  Looks like the limitation comes from slru.  The true payload 
limit is (8K - struct members) the way this is implemented.

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
Greg Stark
Date:
On Thu, Nov 12, 2009 at 3:09 PM, Andrew Chernow <ac@esilo.com> wrote:
>>
>> What advantage is there in limiting it to a tiny size?  This is a
>> 'payload' after all...an arbitrary data block. Looking at the patch I
>> noticed the payload structure (AsyncQueueEntry) is fixed length and
>> designed to lay into QUEUE_PAGESIZE (set to) BLCKSZ sized pages.
>>
>
> Hmmmm.  Looks like the limitation comes from slru.  The true payload limit
> is (8K - struct members) the way this is implemented.

So I'm beginning to think that adding a "payload" to NOTIFY is a bad
idea altogether.

The original use case NOTIFY was designed to handle was to implement
condition variables. You have clients wait on some data structure and
whenever someone changes that data structure they notify all waiting
clients to reread the data structure and check if their condition is
true. The condition might be that the data structure doesn't match
their locally cached copy and they have to rebuild their cache, it
might be that some work queue is non-empty, etc.

So the way to build a queue would be to use a table to hold your work
queue and then use NOTIFY whenever you insert into the queue and any
idle workers would know to go dequeue something from the queue.

It sounds like what people are trying to do is use NOTIFY as a queue
directly. The problem with this is that it was specifically not
designed to do this. As evidenced by the fact that it can't store
arbitrary data structures (nothing else in postgres is limited to only
handling text data, this would be an annoying arbitrary limitation),
can't handle even moderately large data efficiently, etc.

I'm beginning to think the right solution is to reject the idea of
adding a payload to the NOTIFY mechanism and instead provide a queue
contrib module which provides the interface people want.


-- 
greg


Re: Listen / Notify rewrite

From
Joachim Wieland
Date:
On Thu, Nov 12, 2009 at 3:30 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> Couple of questions:
>
> *) is BLCKSZ a hard requirement, that is, coming from the slru
> implementation, or can QUEUE_PAGESIZE be bumped independently of block
> size.

It's the size of slru's pages.

> *) why not make the AsyncQueueEntry divide evenly into BLCKSZ, that
> is, make the whole structure a size that is a multiple of two?  (this
> would make the payload length 'weird')

it's possible and yes, it would make the payload length weird. Also if
we wanted to add further structure members later on (commit time, xid,
notify counter, whatever) we'd have to announce that payload in a new
release cannot have the same length as in a previous one, so I tried
to keep both independent right from the beginning.

> *) is there any downside you see to making the AsyncQueueEntry
> structure exactly BLCKSZ bytes in size?  Are we worried about the
> downsides of spinning the notifications out to disk?

Yes, and also we'd need a lot more page slots to work efficiently and
I fear quite some queue managing overhead (locking). Currently we are
using 4 page slots and can have ~160 notifications mapped into memory.

> *) Is a variable length AsyncQueueEntry possible? (presumably bounded
> by the max page size).  Or does complicate the implementation too
> much?

That's an option to look at, I think it's technically possible, we'd
just need to throw in some more logic but I didnt want to invest too
much effort before we have a clear way to go.

However I share Greg's concerns that people are trying to use NOTIFY
as a message queue which it is not designed to be.


Joachim


Re: Listen / Notify rewrite

From
Bernd Helmle
Date:

--On 12. November 2009 15:48:44 +0000 Greg Stark <gsstark@mit.edu> wrote:

> I'm beginning to think the right solution is to reject the idea of
> adding a payload to the NOTIFY mechanism and instead provide a queue
> contrib module which provides the interface people want.

Isn't PgQ already the solution you have in mind?

-- 
Thanks
Bernd


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
> However I share Greg's concerns that people are trying to use NOTIFY
> as a message queue which it is not designed to be.

When you have an established libpq connection waiting for notifies it is 
not unreasonable to expect/desire a payload.  ISTM, the problem is that 
the initial design was half-baked.  NOTIFY is event-driven, ie. no 
polling!

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
Tom Lane
Date:
Joachim Wieland <joe@mcknight.de> writes:
> However I share Greg's concerns that people are trying to use NOTIFY
> as a message queue which it is not designed to be.

Yes.  Particularly those complaining that they want to have very large
payload strings --- that's pretty much a dead giveaway that it's not
being used as a condition signal.

Now you might say that yeah, that's the point, we're trying to enable
using NOTIFY in a different style.  The problem is that if you are
trying to use NOTIFY as a queue, you will soon realize that it has
the wrong semantics for that --- in particular, losing notifies across
a system crash or client crash is OK for a condition notification,
not so OK for a message queue.  The difference is that the former style
assumes that the authoritative data is in a table somewhere, so you can
still find out what you need to know after reconnecting.  If you are
doing messaging you are likely to think that you don't need any backing
store for the system state.

So while a payload string for NOTIFY has been on the to-do list since
forever, I have to think that Greg's got a good point questioning
whether it is actually a good idea.
        regards, tom lane


Re: Listen / Notify rewrite

From
Dave Page
Date:
On Thu, Nov 12, 2009 at 4:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> So while a payload string for NOTIFY has been on the to-do list since
> forever, I have to think that Greg's got a good point questioning
> whether it is actually a good idea.

Here's an example of why I'd like a payload (and not a queue in an
add-on module). Say you have multiple users running pgAdmin. One of
them creates a new table. Currently, unless the other user refreshes
his view of the database, he won't see it. I'd like to be able to
notify the other pgAdmin sessions that a new table has been created,
so they can automatically display it, without having to poll pg_class
or be manually refreshed. The payload could contain details of type of
object that has been created, and it's OID/identifier to minimise the
work required of the other sessions to find and display the new
object.

And as I'm sure you're already thinking it, yes, I know it doesn't
help if the new table is created using psql, but there are lots of
shops where pgAdmin is the default tool, and it could help them and
just exhibit the current behaviour if someone does break out psql.


-- 
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com


Re: Listen / Notify rewrite

From
Robert Haas
Date:
On Thu, Nov 12, 2009 at 11:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Joachim Wieland <joe@mcknight.de> writes:
>> However I share Greg's concerns that people are trying to use NOTIFY
>> as a message queue which it is not designed to be.
>
> Yes.  Particularly those complaining that they want to have very large
> payload strings --- that's pretty much a dead giveaway that it's not
> being used as a condition signal.
>
> Now you might say that yeah, that's the point, we're trying to enable
> using NOTIFY in a different style.  The problem is that if you are
> trying to use NOTIFY as a queue, you will soon realize that it has
> the wrong semantics for that --- in particular, losing notifies across
> a system crash or client crash is OK for a condition notification,
> not so OK for a message queue.  The difference is that the former style
> assumes that the authoritative data is in a table somewhere, so you can
> still find out what you need to know after reconnecting.  If you are
> doing messaging you are likely to think that you don't need any backing
> store for the system state.
>
> So while a payload string for NOTIFY has been on the to-do list since
> forever, I have to think that Greg's got a good point questioning
> whether it is actually a good idea.

I think there could be cases where the person writing the code can
know, extrinsic to the system, that lost notifications are OK, and
still want to deliver a payload.  But I think the idea of enabling a
huge payload is not wise, as it sounds like it will sacrifice
performance for a feature that is by definition not essential to
anyone who is using this now.  A small payload seems like a reasonable
compromise.

...Robert


Re: Listen / Notify rewrite

From
Merlin Moncure
Date:
On Thu, Nov 12, 2009 at 11:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Joachim Wieland <joe@mcknight.de> writes:
>> However I share Greg's concerns that people are trying to use NOTIFY
>> as a message queue which it is not designed to be.
>
> Yes.  Particularly those complaining that they want to have very large
> payload strings --- that's pretty much a dead giveaway that it's not
> being used as a condition signal.
>
> Now you might say that yeah, that's the point, we're trying to enable
> using NOTIFY in a different style.  The problem is that if you are
> trying to use NOTIFY as a queue, you will soon realize that it has
> the wrong semantics for that --- in particular, losing notifies across
> a system crash or client crash is OK for a condition notification,
> not so OK for a message queue.  The difference is that the former style
> assumes that the authoritative data is in a table somewhere, so you can
> still find out what you need to know after reconnecting.  If you are
> doing messaging you are likely to think that you don't need any backing
> store for the system state.
>
> So while a payload string for NOTIFY has been on the to-do list since
> forever, I have to think that Greg's got a good point questioning
> whether it is actually a good idea.

You guys are assuming it's being used in a queue, which is only a
fraction of cases where this feature is useful.  In fact, having a
payload can remove the need for a queue completely where is currently
required for no other reason to deliver payload messages.

I'm sorry, the 128 character limit is simply lame (other than for
unsolvable implementation/performance complexity which I doubt is the
case here), and if that constraint is put in by the implementation,
than the implementation is busted and should be reworked until it's
right.  A feature that is being used for things not intended is a sign
of a strong feature, not a weak one, and the idea that a payload
should be length limited in order to prevent use in ways that are
'wrong' is a very weak argument IMO.  People have been asking for this
feature since the beginning of time, and nobody said: 'please limit it
to 128 bytes'. A limit of 4k - 64k is much more appropriate if you
even want a hard limit at all...

merlin


Re: Listen / Notify rewrite

From
Merlin Moncure
Date:
On Thu, Nov 12, 2009 at 11:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Nov 12, 2009 at 11:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Joachim Wieland <joe@mcknight.de> writes:
>>> However I share Greg's concerns that people are trying to use NOTIFY
>>> as a message queue which it is not designed to be.
>>
>> Yes.  Particularly those complaining that they want to have very large
>> payload strings --- that's pretty much a dead giveaway that it's not
>> being used as a condition signal.
>>
>> Now you might say that yeah, that's the point, we're trying to enable
>> using NOTIFY in a different style.  The problem is that if you are
>> trying to use NOTIFY as a queue, you will soon realize that it has
>> the wrong semantics for that --- in particular, losing notifies across
>> a system crash or client crash is OK for a condition notification,
>> not so OK for a message queue.  The difference is that the former style
>> assumes that the authoritative data is in a table somewhere, so you can
>> still find out what you need to know after reconnecting.  If you are
>> doing messaging you are likely to think that you don't need any backing
>> store for the system state.
>>
>> So while a payload string for NOTIFY has been on the to-do list since
>> forever, I have to think that Greg's got a good point questioning
>> whether it is actually a good idea.
>
> I think there could be cases where the person writing the code can
> know, extrinsic to the system, that lost notifications are OK, and
> still want to deliver a payload.  But I think the idea of enabling a
> huge payload is not wise, as it sounds like it will sacrifice
> performance for a feature that is by definition not essential to

'premature optimization is the root of all evil' :-)

merlin


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
> Now you might say that yeah, that's the point, we're trying to enable
> using NOTIFY in a different style.  The problem is that if you are
> trying to use NOTIFY as a queue, you will soon realize that it has
> the wrong semantics for that --- in particular, losing notifies across
> a system crash or client crash is OK for a condition notification,
> not so OK for a message queue.  The difference is that the former style
> assumes that the authoritative data is in a table somewhere, so you can
> still find out what you need to know after reconnecting.  If you are
> doing messaging you are likely to think that you don't need any backing
> store for the system state.
> 

I simply don't agree that the semantics have to change.  You call it a 
queue, I call it sesison data.  There is no reason why the documentation 
can't state that notifies may not be delivered due to crashes, so make 
sure to use persistent storage for any payload worth keeping post-session.

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
Merlin Moncure
Date:
On Thu, Nov 12, 2009 at 11:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> I'm sorry, the 128 character limit is simply lame (other than for
> unsolvable implementation/performance complexity which I doubt is the
> case here), and if that constraint is put in by the implementation,
> than the implementation is busted and should be reworked until it's
> right.

After some reflection, I realized this was an overly strong statement
and impolite to the OP.  It's easy to yarp from the gallery with the
other peanuts :-).  It's not the implementation I have an issue with,
just the _idea_ that we should be restricted to small payloads for
religious reasons...until that came upI was already scheming on how to
both extend the patch to be more flexible in terms of payload size,
and to backpatch and test it on 8.4  (no point if the community has no
interest however).  In any event, sorry for the strong words.

merlin


Re: Listen / Notify rewrite

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----                       
Hash: RIPEMD160                                          

Tom Lane writes:

> Yes.  Particularly those complaining that they want to have very large
> payload strings --- that's pretty much a dead giveaway that it's not  
> being used as a condition signal.                                     

I don't want "large" but I do think an arbitrary limit of 128 is odd without 
some justfication. I could do 128, but would probably be happier with a bit  
more room.                                                                   

> Now you might say that yeah, that's the point, we're trying to enable
> using NOTIFY in a different style.  The problem is that if you are   
> trying to use NOTIFY as a queue, you will soon realize that it has   
> the wrong semantics for that --- in particular, losing notifies across
> a system crash or client crash is OK for a condition notification,    
> not so OK for a message queue.  The difference is that the former style
> assumes that the authoritative data is in a table somewhere, so you can
> still find out what you need to know after reconnecting.  If you are   
> doing messaging you are likely to think that you don't need any backing
> store for the system state.                                            

That's putting an awful lot of assumptions on what people are going to do
in the future, and also not being imaginative enough for circumstances
in which a payload which is not system crash survivable is a completely
acceptable condition. In most of my use cases, even desired. People wanting
a real queue can continue to use something other than NOTIFY.

> So while a payload string for NOTIFY has been on the to-do list since
> forever, I have to think that Greg's got a good point questioning
> whether it is actually a good idea.

Absolutely is a good idea.

Agent M asks:

> The notification count could be a secondary "payload" which does not
> affect the first, but I guess I'm the only one complaining about the
> coalescing...

Er...this thread is only a few hours old. For the record, I'm fine with the
coalescing as we do now (at least as far as my own selfish purposes :)

- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200911121836
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkr8nCwACgkQvJuQZxSWSsjQIgCgjH60LlZYEek9FwcD+/w4IHYQ
PWwAnR0YxdSBm5iBa+G+T1VpIP4qjJsx
=Ju0P
-----END PGP SIGNATURE-----




Re: Listen / Notify rewrite

From
Josh Berkus
Date:
On 11/12/09 8:30 AM, Tom Lane wrote:
> So while a payload string for NOTIFY has been on the to-do list since
> forever, I have to think that Greg's got a good point questioning
> whether it is actually a good idea.

Sure, people will abuse it as a queue.  But people abuse arrays when
they should be using child tables, use composite types to make data
non-atomic, and use dblink when they really should be using schema.
Does the potential for misuse mean that we should drop the features?  No.

Payloads are also quite useful even in a lossy environment, where you
understand that LISTEN is not a queue.  For example, I'd like to be
using LISTEN/NOTIFY for cache invalidation for some applications; if it
misses a few, or double-counts them, it's not an issue.  However, I'd
like to be able to send message like players_updated|45321 where 45321
is the ID of the player updated.

--Josh Berkus



Re: Listen / Notify rewrite

From
Robert Haas
Date:
On Thu, Nov 12, 2009 at 8:44 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 11/12/09 8:30 AM, Tom Lane wrote:
>> So while a payload string for NOTIFY has been on the to-do list since
>> forever, I have to think that Greg's got a good point questioning
>> whether it is actually a good idea.
>
> Sure, people will abuse it as a queue.  But people abuse arrays when
> they should be using child tables, use composite types to make data
> non-atomic, and use dblink when they really should be using schema.
> Does the potential for misuse mean that we should drop the features?  No.

I agree.  We frequently reject features on the basis that someone
might do something stupid with them.  It's lame and counterproductive,
and we should stop.  The world contains infinite amounts of lameness,
but that's the world's problem, not ours.  There is zero evidence that
this feature is only useful for stupid purposes, and some evidence
(namely, the opinions of esteemed community members) that it is useful
for at least some non-stupid purposes.

...Robert


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
> and we should stop.  The world contains infinite amounts of lameness,
> but that's the world's problem, not ours.  There is zero evidence that

+1

> this feature is only useful for stupid purposes, and some evidence
> (namely, the opinions of esteemed community members) that it is useful
> for at least some non-stupid purposes.

The unexpected application of a feature can be creative or innovative, which I 
firmly believe is something this community embraces.  How many ways can a screw 
driver be used ... think MacGyver :)  Deteriming whether it's creative vs. 
stupid would require an understanding of the context in which it was applied. 
For example, using our screw driver to remove a splinter would be rather stupid, 
IMHO ;)

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
Steve Atkins
Date:
On Nov 12, 2009, at 5:57 PM, Robert Haas wrote:

> On Thu, Nov 12, 2009 at 8:44 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> On 11/12/09 8:30 AM, Tom Lane wrote:
>>> So while a payload string for NOTIFY has been on the to-do list since
>>> forever, I have to think that Greg's got a good point questioning
>>> whether it is actually a good idea.
>>
>> Sure, people will abuse it as a queue.  But people abuse arrays when
>> they should be using child tables, use composite types to make data
>> non-atomic, and use dblink when they really should be using schema.
>> Does the potential for misuse mean that we should drop the features?  No.
>
> I agree.  We frequently reject features on the basis that someone
> might do something stupid with them.  It's lame and counterproductive,
> and we should stop.  The world contains infinite amounts of lameness,
> but that's the world's problem, not ours.  There is zero evidence that
> this feature is only useful for stupid purposes, and some evidence
> (namely, the opinions of esteemed community members) that it is useful
> for at least some non-stupid purposes.

Speaking as a consumer of this feature, my (mild) concern is not that by
adding functionality it will make it possible to do new things badly, it's that
it might make it harder (or impossible) to do currently supported things as well.

I like the current notify. It's a good match for handling table based
queues or for app-level-cache invalidation. Any changes that make
it less good at that would be a step backwards. The more complex
and flexible and "heavier" the changes, the more that concerns me.

(Though a small payload - I'd settle for at least an integer - would be
convenient, to allow invalidation of 20 different caches without needing
20 different LISTENs).

Cheers, Steve



Re: Listen / Notify rewrite

From
Joachim Wieland
Date:
Unfortunately with all that payload-length discussion, the other part
of my email regarding ACID compliant behavior got completely lost. I
would appreciate some input on that part also...

Thanks,
Joachim

On Thu, Nov 12, 2009 at 12:17 PM, Joachim Wieland <joe@mcknight.de> wrote:
> On Thu, Nov 12, 2009 at 4:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> One possible solution would be to write to the queue before committing
>>> and adding the TransactionID.  Then other backends can check if our
>>> TransactionID has successfully committed or not. Not sure if this is
>>> worth the overhead however...
>>
>> That sounds reasonable, and it's certainly no more overhead than the
>> tuple visibility checks that happen in the current implementation.
>
> I am not too concerned about the runtime of the visibility checks,
> instead I suppose that most of the overhead will come from waiting for
> another transaction to either commit or abort...
>
> If transaction t1 scans the queue and at some point finds
> notifications from t2, then it will ask for the status of t2. If it
> finds out that t2 is still running, then it has no other option than
> to stop working on the queue and wait (it will be signalled again
> later once t2 has finished).
>
> This also means that we cannot at the same time write notifications to
> the queue and read from it and if a transaction places a few million
> notifications into the queue, readers need to wait until it has
> finished and only after that they can continue and read the
> notifications...
>
> And it means that if the queue is full, we might run into a
> deadlock... A transaction adding notifications will wait for the
> readers to proceed and the readers wait for the transaction to commit
> or abort...
>
> One option could be to write new notifications to a subdirectory, and
> create a bunch of new segment files there. Once this is done, the
> segment files could be moved over and renamed, so that they continue
> the slru queue... If we run out of disk space while filling that
> temporary subdirectory, then we can just delete the subdirectory and
> nobody has been blocked. We could still run into errors moving and
> renaming the segment files (e.g. permission problems) so that we still
> might need to abort the transaction...
>
>
>>> 2. The payload parameter is optional. A notifying client can either call
>>> "NOTIFY foo;" or "NOTIFY foo 'payload';". The length of the payload is
>>> currently limited to 128 characters... Not sure if we should allow longer
>>> payload strings...
>>
>> Might be a good idea to make the max the same as the max length for
>> prepared transaction GUIDs?  Not sure anyone would be shipping those
>> around, but it's a pre-existing limit of about the same size.
>
> Yes, sounds reasonable to have the same limit for user-defined identifiers...
>
>
> Joachim
>


Re: Listen / Notify rewrite

From
Greg Stark
Date:
On Fri, Nov 13, 2009 at 1:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I agree.  We frequently reject features on the basis that someone
> might do something stupid with them.  It's lame and counterproductive,
> and we should stop.  The world contains infinite amounts of lameness,
> but that's the world's problem, not ours.  There is zero evidence that
> this feature is only useful for stupid purposes, and some evidence
> (namely, the opinions of esteemed community members) that it is useful
> for at least some non-stupid purposes.

This is BS. The problem is not that someone might do something stupid
with this feature. The problem is that we're making these other use
cases into requirements which will influence the design. This is a
classic "feature creep" situation and the result is normally products
which solve none of the use cases especially well.

Remember this queue has to live in shared memory which is a fixed size
resource. If you're designing a queue mechanism then you would
naturally use something like a queue or priority queue. You expect to
spill to disk and need an efficient storage mechanism. The natural
implementation of this in Postgres would be a table, not the slru. If
you're designing a condition-variable mechanism then you would
naturally use a hash table which can probably live in a single page
with a simple flag for each variable. The comment in another thread
that this mechanism should implement ACID properties just reinforces
my reaction.


--
greg


Re: Listen / Notify rewrite

From
Merlin Moncure
Date:
On Fri, Nov 13, 2009 at 5:35 AM, Greg Stark <gsstark@mit.edu> wrote:
> On Fri, Nov 13, 2009 at 1:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I agree.  We frequently reject features on the basis that someone
>> might do something stupid with them.  It's lame and counterproductive,
>> and we should stop.  The world contains infinite amounts of lameness,
>> but that's the world's problem, not ours.  There is zero evidence that
>> this feature is only useful for stupid purposes, and some evidence
>> (namely, the opinions of esteemed community members) that it is useful
>> for at least some non-stupid purposes.
>
> This is BS. The problem is not that someone might do something stupid
> with this feature. The problem is that we're making these other use
> cases into requirements which will influence the design. This is a
> classic "feature creep" situation and the result is normally products
> which solve none of the use cases especially well.

Removing a length restriction is feature creep?

Having an flexible payload mechanism improves on notify in the same
way that epoll improves on poll.   Yes, epoll is overdesigned, highly
overused, etc. but it does vastly improve server
handling/responsiveness in some situations.  Delivering a notification
with data saves a round trip back to the server and a transaction
which is both helpful in terms of server load and improving latency.
On top of that, I don't think saying: "hello; here's some data" is
groundbreaking in terms of network communication paradigms.

My interest in this feature is not academic, the project I'm working
on could use it with great benefit immediately. Arguments that I am
using notify for the set list of use cases improvised by the original
authors are not going to hold much water with me :-).

IMNSHO, I don't think that keeping payloads limited to a tiny size
'improves' this feature is a winnable argument.  That said, I do
appreciate simple designs and very much understand trying to keep
things simple.  So let me ask you this:

*) Are you sure that putting a variable length payload into the slru
is going to complicate things that badly in terms of implementing this
feature?  If so, how?

*) Wouldn't you agree that variable length would actually benefit
'proper' (small) payloads by allowing more of them to fit in the slru
page?

*) 8k should be enough for anybody :-) ...so if a variable length
structure can be made why not max the payload length at blcksz-hdrsz
and call it a day (yes, I am aware that extending the structure will
reduce payload maximum length)? I think this should fit quite nicely
into the OP's approach and benefits both people who use small payloads
and large ones...(I DO think spanning pages is complex and probably
unnecessary)

merlin


Re: Listen / Notify rewrite

From
Alvaro Herrera
Date:
Joachim Wieland wrote:

> 1. Instead of placing the queue into shared memory only I propose to create a
> new subdirectory pg_notify/ and make the queue slru-based, such that we do not
> risk blocking. Several people here have pointed out that blocking is a true
> no-go for a new listen/notify implementation. With an slru-based queue we have
> so much space that blocking is really unlikely even in periods with extreme
> notify bursts.
> Regarding performance, the slru-queue is not fsync-ed to disk so most activity
> would be in the OS file cache memory anyway and most backends will probably
> work on the same pages most of the time. However more locking overhead is
> required in comparison to a shared-memory-only implementation.

Note, however, that pg_subtrans is also supposed to use "the same pages
from memory most of the time" and it still is a performance bottleneck
in some cases, and enlarging NUM_SUBTRANS_BUFFERS is a huge boon.  I
think holding AsyncCtlLock in exclusive mode for every notify add (which
may involve I/O) is going to be a hard hit on scalability.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
> spill to disk and need an efficient storage mechanism. The natural
> implementation of this in Postgres would be a table, not the slru. If

This is what I think the people's real problem is, the implementation becomes a 
more complex when including payloads (larger ones even more so).  I think its a 
side-track to discuss queue vs condition variables.  Whether a notify is 20 
bytes through the network or 8192 bytes doesn't change its design or purpose, 
only its size.

Calling this a creeping feature is quite a leap.

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
Greg Stark
Date:
On Fri, Nov 13, 2009 at 1:47 PM, Andrew Chernow <ac@esilo.com> wrote:
> This is what I think the people's real problem is, the implementation
> becomes a more complex when including payloads (larger ones even more so).
>  I think its a side-track to discuss queue vs condition variables.  Whether
> a notify is 20 bytes through the network or 8192 bytes doesn't change its
> design or purpose, only its size.
>
> Calling this a creeping feature is quite a leap.

It's true that the real creep is having the payload at all rather than
not having it. But having fixed-size data also makes the
implementation much simpler as well.

One person described stuffing the payload with the primary key of the
record being invalidated. This means the requirements have just gone
from holding at most some small fixed number of records bounded by the
number of tables or other shared data structures to holding a large
number of records bounded only by the number of records in their
tables which is usually much much larger.

Now you're talking about making the payloads variable size, which
means you need to do free space management within shared pages to keep
track of how much space is free and available for reuse.

So we've gone from a simple hash table of fixed size entries
containing an oid or "name" datum where we expect the hash table to
fit in memory and a simple lru can handle old pages that aren't part
of the working set to something that's going to look a lot like a
database table -- it has to handle reusing space in collections of
variable size data and scale up to millions of entries.

And I note someone else in the thread was suggesting it needed ACID
properties which makes space reuse even more complex and will need
something like vacuum to implement it.

--
greg


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
>> Calling this a creeping feature is quite a leap.
> 
> It's true that the real creep is having the payload at all rather than
> not having it.

Not having the payload at all is like santa showing up without his bag 
of toys.  Instead, you have to drive/fly to the north pole where he just 
came from to get them.

> 
> One person described stuffing the payload with the primary key of the
> record being invalidated. This means the requirements have just gone
> from holding at most some small fixed number of records bounded by the
> number of tables or other shared data structures to holding a large
> number of records bounded only by the number of records in their
> tables which is usually much much larger.
> 
> Now you're talking about making the payloads variable size, which
> means you need to do free space management within shared pages to keep
> track of how much space is free and available for reuse.
> 
> So we've gone from a simple hash table of fixed size entries
> containing an oid or "name" datum where we expect the hash table to
> fit in memory and a simple lru can handle old pages that aren't part
> of the working set to something that's going to look a lot like a
> database table -- it has to handle reusing space in collections of
> variable size data and scale up to millions of entries.
> 
> And I note someone else in the thread was suggesting it needed ACID
> properties which makes space reuse even more complex and will need
> something like vacuum to implement it.
> 

I think the original OP was close.  The structure can still be fixed 
length but maybe we can bump it to 8k (BLCKSZ)?

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
Merlin Moncure
Date:
On Fri, Nov 13, 2009 at 10:00 AM, Andrew Chernow <ac@esilo.com> wrote:
> I think the original OP was close.  The structure can still be fixed length
> but maybe we can bump it to 8k (BLCKSZ)?

The problem with this (which I basically agree with) is that this will
greatly increase the size of the queue for all participants of this
feature if they use the payload or not.  I think it boils down to
this: is there a reasonably effective way of making the payload
variable length (now or in the future)?  If not, let's compromise and
maybe go with a larger size, maybe 256 or 512 bytes.

merlin


Re: Listen / Notify rewrite

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> The problem with this (which I basically agree with) is that this will
> greatly increase the size of the queue for all participants of this
> feature if they use the payload or not.  I think it boils down to
> this: is there a reasonably effective way of making the payload
> variable length (now or in the future)?  If not, let's compromise and
> maybe go with a larger size, maybe 256 or 512 bytes.

Yeah, if the payload is not variable length then we are not going to be
able to make it more than a couple hundred bytes without taking a
significant performance hit.  (By the way, has anyone yet tried to
compare the speed of this implementation to the old code?)
        regards, tom lane


Re: Listen / Notify rewrite

From
Andrew Dunstan
Date:

Merlin Moncure wrote:
> On Fri, Nov 13, 2009 at 10:00 AM, Andrew Chernow <ac@esilo.com> wrote:
>   
>> I think the original OP was close.  The structure can still be fixed length
>> but maybe we can bump it to 8k (BLCKSZ)?
>>     
>
> The problem with this (which I basically agree with) is that this will
> greatly increase the size of the queue for all participants of this
> feature if they use the payload or not.  I think it boils down to
> this: is there a reasonably effective way of making the payload
> variable length (now or in the future)?  If not, let's compromise and
> maybe go with a larger size, maybe 256 or 512 bytes.
>
>
>   

My original intention was to have the queue as a circular buffer where 
the size of the entries was variable, but possibly bounded. Certainly 
using fixed length records of large size seems somewhat wasteful.

But maybe that doesn't fit with what Joachim has done.

Incidentally, I'd like to thank Joachim personally for having done this 
work, that I have been trying to get to for a couple of years, and that 
circumstances kept conspiring to prevent me from doing. It's been a big 
monkey on my back.

cheers

andrew


Re: Listen / Notify rewrite

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----                                   
Hash: RIPEMD160                                                      


> This is BS. The problem is not that someone might do something stupid
> with this feature. The problem is that we're making these other use
> cases into requirements which will influence the design. This is a
> classic "feature creep" situation and the result is normally products
> which solve none of the use cases especially well.

Feature creep? The payload has been on the roadmap for a long time. I don't
recall anyone objecting when Andrew was working on the next version of
Listen/Notify around what is probably a couple of years ago now.

> Remember this queue has to live in shared memory which is a fixed size
> resource. If you're designing a queue mechanism then you would
> naturally use something like a queue or priority queue.

Right, but we're not discussing a queue, we're discussing the listen/notify
system. If people want to mis-use it as a queue when they should be using
something else, so be it. Talk of efficiency also seems silly here - using
shared memory is already way more efficient than our current listen/notify
system.

- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200911131234
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkr9mL0ACgkQvJuQZxSWSshkvACg6OQ/SRjkvmozzUogTX3weuio
4ZoAnRVfvcrdMmo+iKtkyXmhAlZqViqF
=6fzv
-----END PGP SIGNATURE-----




Re: Listen / Notify rewrite

From
Tom Lane
Date:
"Greg Sabino Mullane" <greg@turnstep.com> writes:
> Talk of efficiency also seems silly here - using
> shared memory is already way more efficient than our current listen/notify
> system.

Except that the proposed implementation spills to disk.  Particularly if
it has to have support for large payloads, it could very well end up
being a lot SLOWER than what we have now.
        regards, tom lane


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
Tom Lane wrote:
> "Greg Sabino Mullane" <greg@turnstep.com> writes:
>> Talk of efficiency also seems silly here - using
>> shared memory is already way more efficient than our current listen/notify
>> system.
> 
> Except that the proposed implementation spills to disk.  Particularly if
> it has to have support for large payloads, it could very well end up
> being a lot SLOWER than what we have now.
> 

True, but do you really consider it to be a common case that the notify 
system gets soo bogged down that it starts to crawl?  The problem would 
be the collective size of notify structures + payloads and whether that 
would fit in memory or not.  This leads me to believe that the only 
safety in smaller payloads is *possibly* a smaller chance of bogging it 
down, but that all depends on the usage pattern of smaller vs. larger 
payloads which is system specific.

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
Andrew Chernow
Date:
> My original intention was to have the queue as a circular buffer where 
> the size of the entries was variable, but possibly bounded. Certainly 
> using fixed length records of large size seems somewhat wasteful.
> 

Maybe we should do away with 'spill to disk' all together and either 
hard-code an overflow behavior or make it a knob.  Possible overflow 
behaviors could be block until space is available, throw an error or 
silently drop it.

Can the size of the shared memory segment for notifications be 
configurable?  That would allow those with large payloads or a huge 
number of notifications to bump memory to avoid overflow cases.  By 
removing the disk and making shmem usage configurable, I think the 
notify system would be flexible and could scale nicely.

Another added benefit is the payload limit can be much higher than 
previously considered, because memory/performance concerns are now in 
the hands of the DBA.

> Incidentally, I'd like to thank Joachim personally for having done this 
> work, 

+1

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Listen / Notify rewrite

From
James Mansion
Date:
Josh Berkus wrote:
> Payloads are also quite useful even in a lossy environment, where you
> understand that LISTEN is not a queue.  For example, I'd like to be
> using LISTEN/NOTIFY for cache invalidation for some applications; if it
> misses a few, or double-counts them, it's not an issue.  However, I'd
>   
Not sure how missing a few can not be an issue for cache invalidation, 
but never mind.

In the use-cases I've considered I've used a trigger to write a change 
notification to a table
and a notify to indicate that notification record(s) have been changed. 
The notifications
contain copies of the primary keys and the action so the cache processor 
knows what's
changed and the notify is a a wakeup signal.

If this is in fact the most common use case, perhaps an alternative 
approach would be
to automate it directly, so that writing the triggers (and using the 
trigger processing engines)
would be unecessary, so:- the queue definition can be automated with reference to the parent 
table, by DDL stating that one is required- the notification 'name' is effectively the queue name and the 
subscription says 'tell me when a change note is placed in the queue'

Doing this in the database engine core allows a number of potential 
optimisations:- the mechanism does not require general trigger execution- the queue does not have to be a real table,
andcan have custom 
 
semantics: it may not actually be necessary to store copies of the primary key data if it 
can refer to the rows so the data can be retrieved as the queue is consumed- if there are no subscribers to the queue
thenthe insertion to it can 
 
be elided- if the server crashes, connected consumers should assume caches are 
invalid and theer is no ACID requirement for the queue data- if the queue fills then slow consumer(s) can be dropped
andcan 
 
receive a data loss indicator
> like to be able to send message like players_updated|45321 where 45321
> is the ID of the player updated.
>   
Indeed.

Just a thought, anyway.  (FWIW I was initially concerned about the lack 
of payload, but
with any sort of lossy compression I figured it wasn't, actually, and I 
needed a notification
queue)


James



Re: Listen / Notify rewrite

From
Merlin Moncure
Date:
On Fri, Nov 13, 2009 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
 (By the way, has anyone yet tried to
> compare the speed of this implementation to the old code?)

I quickly hacked pgbench to take a custom script on connection (for
listen), and make pgbench'd 'notify x'; with all clients doing 'listen
x'.

The old method (measured on a 4 core high performance server) has
severe scaling issues due to table bloat (we knew that):
./pgbench -c 10 -t 1000 -n -b listen.sql -f notify.sql
run #1 tps = 1364.948079 (including connections establishing)
run #2 tps = 573.988495 (including connections establishing)
<vac full pg_listener>
./pgbench -c 50 -t 200 -n -b listen.sql -f notify.sql
tps = 844.033498 (including connections establishing)


new method on my dual core workstation (max payload 128):
./pgbench -c 10 -t 10000 -n -b listen.sql -f notify.sql -hlocalhost postgres
tps = 16343.012373 (including connections establishing)
./pgbench -c 20 -t 5000 -n -b listen.sql -f notify.sql -hlocalhost postgres
tps = 7642.104941 (including connections establishing)
./pgbench -c 50 -t 5000 -n -b listen.sql -f notify.sql -hlocalhost postgres
tps = 3184.049268 (including connections establishing)

max payload 2048:
./pgbench -c 10 -t 10000 -n -b listen.sql -f notify.sql -hlocalhost postgres
tps = 12062.906610 (including connections establishing)
./pgbench -c 20 -t 5000 -n -b listen.sql -f notify.sql -hlocalhost postgres
tps = 7229.505869 (including connections establishing)
./pgbench -c 50 -t 5000 -n -b listen.sql -f notify.sql -hlocalhost postgres
tps = 3219.511372 (including connections establishing)

getting sporadic 'LOG:  could not send data to client: Broken pipe'
throughout the test.

merlin


Re: Listen / Notify rewrite

From
Alex
Date:
On Thu, 12 Nov 2009 11:22:32 -0500
Andrew Chernow <ac@esilo.com> wrote:

> 
> > However I share Greg's concerns that people are trying to use NOTIFY
> > as a message queue which it is not designed to be.
> 
> When you have an established libpq connection waiting for notifies it
> is not unreasonable to expect/desire a payload.  ISTM, the problem is
> that the initial design was half-baked.  NOTIFY is event-driven, ie.
> no polling!
> 

I agree. Wouldn't it make sense to allow the user to pass libpq a
callback function which is executed when NOTIFY events happen? Currently
we are forced to poll the connection, which means that we'll be checking
for a NOTIFY every time we have new data.

That just doesn't make sense.

-- 
Alex


Re: Listen / Notify rewrite

From
Simon Riggs
Date:
On Wed, 2009-11-11 at 22:25 +0100, Joachim Wieland wrote:

>  3. Every distinct notification is delivered.

> Regarding performance, the slru-queue is not fsync-ed to disk

These two statements seem to be in opposition. How do you know a
notification will be delivered if the queue is non-recoverable? Surely
the idea is to send information externally to the database, so why
should that stream of info be altered depending upon whether the
database crashes? You couldn't use it to reliably update an external
cache for example.

Why do we need this as well as PgQ? For me, I would need a good reason
why this shouldn't be implemented using a normal table, plus bells and
whistles. If normal tables don't do what you need, perhaps that's a
place to add value.

-- Simon Riggs           www.2ndQuadrant.com



Re: Listen / Notify rewrite

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Wed, 2009-11-11 at 22:25 +0100, Joachim Wieland wrote:
>> 3. Every distinct notification is delivered.
>> Regarding performance, the slru-queue is not fsync-ed to disk

> These two statements seem to be in opposition. How do you know a
> notification will be delivered if the queue is non-recoverable?

You misunderstand the requirements.  LISTEN notifications are *not*
meant to survive a database crash, and never have been.  However,
so long as both client and server stay up, they must be reliable.
If the client has to poll database state because it might have
missed a notification, the feature is just a waste of time.
        regards, tom lane


Re: Listen / Notify rewrite

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 16:48 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Wed, 2009-11-11 at 22:25 +0100, Joachim Wieland wrote:
> >> 3. Every distinct notification is delivered.
> >> Regarding performance, the slru-queue is not fsync-ed to disk
> 
> > These two statements seem to be in opposition. How do you know a
> > notification will be delivered if the queue is non-recoverable?
> 
> You misunderstand the requirements.  LISTEN notifications are *not*
> meant to survive a database crash, and never have been.  However,
> so long as both client and server stay up, they must be reliable.
> If the client has to poll database state because it might have
> missed a notification, the feature is just a waste of time.

Why would it be so important for messages to be reliable if the database
is up, yet its OK to lose messages if it crashes? The application must
still allow for the case that messages are lost. 

-- Simon Riggs           www.2ndQuadrant.com



Re: Listen / Notify rewrite

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Sun, 2009-11-15 at 16:48 -0500, Tom Lane wrote:
>> You misunderstand the requirements.  LISTEN notifications are *not*
>> meant to survive a database crash, and never have been.  However,
>> so long as both client and server stay up, they must be reliable.
>> If the client has to poll database state because it might have
>> missed a notification, the feature is just a waste of time.

> Why would it be so important for messages to be reliable if the database
> is up, yet its OK to lose messages if it crashes? The application must
> still allow for the case that messages are lost. 

No, that's the point.  The design center for LISTEN is that you have a
client that needs to respond to changes in the DB state.  When it first
connects it will issue LISTEN and then (order is important) it will
examine the current state of the database.  After that it can just wait
for NOTIFY to tell it that something interesting has happened.  If it
crashes, or sees a disconnect indicating that the server has crashed,
it goes back to the startup point.  No problem.  But if it can't be sure
that it will get a NOTIFY every time something happens to the DB state,
then it has to do active polling of the state instead, and the NOTIFY
feature is really worthless to it.

This is an entirely useful and reliable feature within these parameters
--- the first application I ever wrote using PG relied on NOTIFY to work
this way.  (In fact it wouldn't be overstating the case to say that
I wouldn't be a PG hacker today if it weren't for LISTEN/NOTIFY.)
        regards, tom lane


Re: Listen / Notify rewrite

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


>> You misunderstand the requirements.  LISTEN notifications are *not*
>> meant to survive a database crash, and never have been.  However,
>> so long as both client and server stay up, they must be reliable.
>> If the client has to poll database state because it might have
>> missed a notification, the feature is just a waste of time.

> Why would it be so important for messages to be reliable if
> the database is up, yet its OK to lose messages if it crashes? The
> application must still allow for the case that messages are lost.

Well, there are many use cases. For example, Bucardo uses notifications
to let it know that a table has changed. If the database crashes,
Bucardo is going to restart - as part of its startup routine, it checks
all tables manually for changes, eliminating the need for the NOTIFYs
to survive the crash.

- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation
PGP Key: 0x14964AC8 200911160910
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAksBXkAACgkQvJuQZxSWSsjEWACePcT+65HQ0dvx52PjjTkdMzVS
ELMAnAhR3Ll016/EwPdizzS5BcsuXaw9
=jds6
-----END PGP SIGNATURE-----




Re: Listen / Notify rewrite

From
Joachim Wieland
Date:
On Sat, Nov 14, 2009 at 11:06 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> The old method (measured on a 4 core high performance server) has
> severe scaling issues due to table bloat (we knew that):
> ./pgbench -c 10 -t 1000 -n -b listen.sql -f notify.sql
> run #1 tps = 1364.948079 (including connections establishing)

> new method on my dual core workstation (max payload 128):
> ./pgbench -c 10 -t 10000 -n -b listen.sql -f notify.sql -hlocalhost postgres
> tps = 16343.012373 (including connections establishing)

That looks fine and is similar to my tests where I also see a
performance increase of about 10x, and unlike pg_listener it is
constant.

> getting sporadic 'LOG:  could not send data to client: Broken pipe'
> throughout the test.

This looks like the server is trying to send a notification down to
the client but the client has already terminated the connection...


Joachim


Re: Listen / Notify rewrite

From
Merlin Moncure
Date:
On Mon, Nov 16, 2009 at 4:41 PM, Joachim Wieland <joe@mcknight.de> wrote:
> On Sat, Nov 14, 2009 at 11:06 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> The old method (measured on a 4 core high performance server) has
>> severe scaling issues due to table bloat (we knew that):
>> ./pgbench -c 10 -t 1000 -n -b listen.sql -f notify.sql
>> run #1 tps = 1364.948079 (including connections establishing)
>
>> new method on my dual core workstation (max payload 128):
>> ./pgbench -c 10 -t 10000 -n -b listen.sql -f notify.sql -hlocalhost postgres
>> tps = 16343.012373 (including connections establishing)
>
> That looks fine and is similar to my tests where I also see a
> performance increase of about 10x, and unlike pg_listener it is
> constant.

old method scaled (badly) on volume of notifications and your stuff
seems to scale based on # of client's sending simultaneous
notifications.   Well, you're better all day long, but it shows that
your fears regarding locking were not completely unfounded.  Do the
Burcardo people have any insights on the #of simultaneous notifies are
generated from different backends?

merlin


Re: Listen / Notify rewrite

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


> old method scaled (badly) on volume of notifications and your stuff
> seems to scale based on # of client's sending simultaneous
> notifications.   Well, you're better all day long, but it shows that
> your fears regarding locking were not completely unfounded.  Do the
> Burcardo people have any insights on the #of simultaneous notifies are
> generated from different backends?

Very low. On a busy system I know of there are about 90 entries in
pg_listener, and I would guess that the maximum rate of simulataneous
notifies is not more than 3 per second, tops. If someone knows an
easy way to measure such a thing and is really curious, I can see about
getting better numbers.

- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation
PGP Key: 0x14964AC8 200911162127
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAksCCjsACgkQvJuQZxSWSsgTogCfS5Xg8N2JhsUpi2r96IbxX+Tm
pMsAnAktBVkEblzx6Ux/netXkP9u4AVG
=SO/k
-----END PGP SIGNATURE-----