Re: bg worker: patch 1 of 6 - permanent process - Mailing list pgsql-hackers

From Markus Wanner
Subject Re: bg worker: patch 1 of 6 - permanent process
Date
Msg-id 4C780144.8080407@bluegap.ch
Whole thread Raw
In response to Re: bg worker: patch 1 of 6 - permanent process  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: bg worker: patch 1 of 6 - permanent process
List pgsql-hackers
Hi,

On 08/26/2010 11:57 PM, Robert Haas wrote:
> It wouldn't require you to preallocate a big chunk of shared memory

Agreed, you wouldn't have to allocate it in advance. We would still want 
a configurable upper limit. So this can be seen as another approach for 
an implementation of a dynamic allocator. (Which should be separate from 
the exact imessages implementation, just for the sake of modularization 
already, IMO).

> In addition, it means that maximum_message_queue_size_per_backend (or
> whatever it's called) can be changed on-the-fly; that is, it can be
> PGC_SIGHUP rather than PGC_POSTMASTER.

That's certainly a point. However, as you are proposing a solution to 
just one subsystem (i.e. imessages), I don't find it half as convincing.

If you are saying it *should* be possible to resize shared memory in a 
portable way, why not do it for *all* subsystems right away? I still 
remember Tom saying it's not something that's doable in a portable way. 
Why and how should it be possible for a per-backend basis? How portable 
is mmap() really? Why don't we use in in Postgres as of now?

I certainly think that these are orthogonal issues: whether to use fixed 
boundaries or to dynamically allocate the memory available is one thing, 
dynamic resizing is another. If the later is possible, I'm certainly not 
opposed to it. (But would still favor dynamic allocation).

> As to efficiency, the process is not much different once the initial
> setup is completed.

I fully agree to that.

I'm more concerned about ease of use for developers. Simply being able 
to alloc() from shared memory makes things easier than having to invent 
a separate allocation method for every subsystem, again and again (the 
argument that people are more used to multi-threaded argument).

> Doing the extra setup just to send one or two messages
> might suck.  But maybe that just means this isn't the right mechanism
> for those cases (e.g. the existing XID-wraparound logic should still
> use signal multiplexing rather than this system).  I see the value of
> this as being primarily for streaming big chunks of data, not so much
> for sending individual, very short messages.

I agree that simple signals don't need a full imessage. But as soon as 
you want to send some data (like which database to vacuum), or require 
the delivery guarantee (i.e. no single message gets lost, as opposed to 
signals), then imessages should be cheap enough.

>> The current approach uses plain spinlocks, which are more efficient. Note
>> that both, appending as well as removing from the queue are writing
>> operations, from the point of view of the queue. So I don't think LWLocks
>> buy you anything here, either.
>
> I agree that this might not be useful.  We don't really have all the
> message types defined yet, though, so it's hard to say.

What does the type of lock used have to do with message types? IMO it 
doesn't matter what kind of message or what size you want to send. For 
appending or removing a pointer to or from a message queue, a spinlock 
seems to be just the right thing to use.

>> I understand the need to limit the amount of data in flight, but I don't
>> think that sending any type of message should ever block. Messages are
>> atomic in that regard. Either they are ready to be delivered (in entirety)
>> or not. Thus the sender needs to hold back the message, if the recipient is
>> overloaded. (Also note that currently imessages are bound to a maximum size
>> of around 8 KB).
>
> That's functionally equivalent to blocking, isn't it?  I think that's
> just a question of what API you want to expose.

Hm.. well, yeah, depends on what level you are arguing. The imessages 
API can be used in a completely non-blocking fashion. So a process can 
theoretically do other work while waiting for messages.

For parallel querying, the helper/worker backends would probably need to 
block, if the origin backend is not ready to accept more data, yes. 
However, making it accept and process another job in the mean time seems 
hard to do. But not an imessages problem per se. (While with the above 
streaming layer I've mentioned, that would not be possible, because that 
blocks).

> For replication, that might be the case, but for parallel query,
> per-queue seems about right.  At any rate, no design we've discussed
> will let individual queues grow without bound.

Extend parallel querying to multiple nodes and you are back at the same 
requirement.

However, it's certainly something that can be done atop imessages. I'm 
unsure if doing it as part of imessages is a good thing or not. Given 
the above requirement, I don't currently think so. Using multiple queues 
with different priorities, as you proposed, would probably make it more 
feasible.

> You probably need this, but 8KB seems like a pretty small chunk size.

For node-internal messaging, I probably agree. Would need benchmarking, 
as it's a compromise between latency and overhead, IMO.

I've chosen 8KB so these messages (together with some GCS and other 
transport headers) presumably fit into ethernet jumbo frames. I'd argue 
that you'd want even smaller chunk sizes for 1500 byte MTUs, because I 
don't expect the GCS to do a better job at fragmenting, than we can do 
in the upper layer (i.e. without copying data and w/o additional latency 
when reassembling the packet). But again, maybe that should be 
benchmarked, first.

> I think one of the advantages of a per-backend area is that you don't
> need to worry so much about fragmentation.  If you only need in-order
> message delivery, you can just use the whole thing as a big ring
> buffer.

Hm.. interesting idea. It's similar to my initial implementation, except 
that I had only a single ring-buffer for all backends.

> There's no padding or sophisticated allocation needed.  You
> just need a pointer to the last byte read (P1), the last byte allowed
> to be read (P2), and the last byte allocated (P3).  Writers take a
> spinlock, advance P3, release the spinlock, write the message, take
> the spinlock, advance P2, release the spinlock, and signal the reader.

That would block parallel writers (i.e. only one process can write to 
the queue at any time).

> Readers take the spinlock, read P1 and P2, release the spinlock, read
> the data, take the spinlock, advance P1, and release the spinlock.

It would require copying data in case a process only needs to forward 
the message. That's a quick pointer dequeue and enqueue exercise ATM.

> You might still want to fragment chunks of data to avoid problems if,
> say, two writers are streaming data to a single reader.  In that case,
> if the messages were too large compared to the amount of buffer space
> available, you might get poor utilization, or even starvation.  But I
> would think you wouldn't need to worry about that until the message
> size got fairly high.

Some of the writers in Postgres-R allocate the chunk for the message in 
shared memory way before they send the message. I.e. during a write 
operation of a transaction that needs to be replicated, the backend 
allocates space for a message at the start of the operation, but only 
fills it with change set data during processing. That can possibly take 
quite a while.

Decoupling memory allocation from message queue management allows to do 
this without having to copy the data. The same holds true for forwarding 
a message.

> Well, what I was thinking about is the fact that data messages are
> bigger.  If I'm writing a 16-byte message once a minute and the reader
> and I block each other until the message is fully read or written,
> it's not really that big of a deal.  If the same thing happens when
> we're trying to continuously stream tuple data from one process to
> another, it halves the throughput; we expect both processes to be
> reading/writing almost constantly.

Agreed. Unlike the proposed ring-buffer approach, the separate allocator 
approach doesn't have that problem, because writing itself is fully 
parallelized, even to the same recipient.

> I think unicast messaging is really useful and I really want it, but
> the requirement that it be done through dynamic shared memory
> allocations feels very uncomfortable to me (as you've no doubt
> gathered).

Well, I on the other hand am utterly uncomfortable with having a 
separate solution for memory allocation per sub-system (and it 
definitely is an inherent problem to lots of our subsystems). Given the 
ubiquity of dynamic memory allocators, I don't really understand your 
discomfort.

Thanks for discussing, I always enjoy respectful disagreement.

Regards

Markus Wanner


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pg_subtrans keeps bloating up in the standby
Next
From: Tom Lane
Date:
Subject: Re: Git conversion progress report and call for testing assistance