Re: bg worker: patch 1 of 6 - permanent process - Mailing list pgsql-hackers
From | Markus Wanner |
---|---|
Subject | Re: bg worker: patch 1 of 6 - permanent process |
Date | |
Msg-id | 4C780144.8080407@bluegap.ch Whole thread Raw |
In response to | Re: bg worker: patch 1 of 6 - permanent process (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: bg worker: patch 1 of 6 - permanent process
|
List | pgsql-hackers |
Hi, On 08/26/2010 11:57 PM, Robert Haas wrote: > It wouldn't require you to preallocate a big chunk of shared memory Agreed, you wouldn't have to allocate it in advance. We would still want a configurable upper limit. So this can be seen as another approach for an implementation of a dynamic allocator. (Which should be separate from the exact imessages implementation, just for the sake of modularization already, IMO). > In addition, it means that maximum_message_queue_size_per_backend (or > whatever it's called) can be changed on-the-fly; that is, it can be > PGC_SIGHUP rather than PGC_POSTMASTER. That's certainly a point. However, as you are proposing a solution to just one subsystem (i.e. imessages), I don't find it half as convincing. If you are saying it *should* be possible to resize shared memory in a portable way, why not do it for *all* subsystems right away? I still remember Tom saying it's not something that's doable in a portable way. Why and how should it be possible for a per-backend basis? How portable is mmap() really? Why don't we use in in Postgres as of now? I certainly think that these are orthogonal issues: whether to use fixed boundaries or to dynamically allocate the memory available is one thing, dynamic resizing is another. If the later is possible, I'm certainly not opposed to it. (But would still favor dynamic allocation). > As to efficiency, the process is not much different once the initial > setup is completed. I fully agree to that. I'm more concerned about ease of use for developers. Simply being able to alloc() from shared memory makes things easier than having to invent a separate allocation method for every subsystem, again and again (the argument that people are more used to multi-threaded argument). > Doing the extra setup just to send one or two messages > might suck. But maybe that just means this isn't the right mechanism > for those cases (e.g. the existing XID-wraparound logic should still > use signal multiplexing rather than this system). I see the value of > this as being primarily for streaming big chunks of data, not so much > for sending individual, very short messages. I agree that simple signals don't need a full imessage. But as soon as you want to send some data (like which database to vacuum), or require the delivery guarantee (i.e. no single message gets lost, as opposed to signals), then imessages should be cheap enough. >> The current approach uses plain spinlocks, which are more efficient. Note >> that both, appending as well as removing from the queue are writing >> operations, from the point of view of the queue. So I don't think LWLocks >> buy you anything here, either. > > I agree that this might not be useful. We don't really have all the > message types defined yet, though, so it's hard to say. What does the type of lock used have to do with message types? IMO it doesn't matter what kind of message or what size you want to send. For appending or removing a pointer to or from a message queue, a spinlock seems to be just the right thing to use. >> I understand the need to limit the amount of data in flight, but I don't >> think that sending any type of message should ever block. Messages are >> atomic in that regard. Either they are ready to be delivered (in entirety) >> or not. Thus the sender needs to hold back the message, if the recipient is >> overloaded. (Also note that currently imessages are bound to a maximum size >> of around 8 KB). > > That's functionally equivalent to blocking, isn't it? I think that's > just a question of what API you want to expose. Hm.. well, yeah, depends on what level you are arguing. The imessages API can be used in a completely non-blocking fashion. So a process can theoretically do other work while waiting for messages. For parallel querying, the helper/worker backends would probably need to block, if the origin backend is not ready to accept more data, yes. However, making it accept and process another job in the mean time seems hard to do. But not an imessages problem per se. (While with the above streaming layer I've mentioned, that would not be possible, because that blocks). > For replication, that might be the case, but for parallel query, > per-queue seems about right. At any rate, no design we've discussed > will let individual queues grow without bound. Extend parallel querying to multiple nodes and you are back at the same requirement. However, it's certainly something that can be done atop imessages. I'm unsure if doing it as part of imessages is a good thing or not. Given the above requirement, I don't currently think so. Using multiple queues with different priorities, as you proposed, would probably make it more feasible. > You probably need this, but 8KB seems like a pretty small chunk size. For node-internal messaging, I probably agree. Would need benchmarking, as it's a compromise between latency and overhead, IMO. I've chosen 8KB so these messages (together with some GCS and other transport headers) presumably fit into ethernet jumbo frames. I'd argue that you'd want even smaller chunk sizes for 1500 byte MTUs, because I don't expect the GCS to do a better job at fragmenting, than we can do in the upper layer (i.e. without copying data and w/o additional latency when reassembling the packet). But again, maybe that should be benchmarked, first. > I think one of the advantages of a per-backend area is that you don't > need to worry so much about fragmentation. If you only need in-order > message delivery, you can just use the whole thing as a big ring > buffer. Hm.. interesting idea. It's similar to my initial implementation, except that I had only a single ring-buffer for all backends. > There's no padding or sophisticated allocation needed. You > just need a pointer to the last byte read (P1), the last byte allowed > to be read (P2), and the last byte allocated (P3). Writers take a > spinlock, advance P3, release the spinlock, write the message, take > the spinlock, advance P2, release the spinlock, and signal the reader. That would block parallel writers (i.e. only one process can write to the queue at any time). > Readers take the spinlock, read P1 and P2, release the spinlock, read > the data, take the spinlock, advance P1, and release the spinlock. It would require copying data in case a process only needs to forward the message. That's a quick pointer dequeue and enqueue exercise ATM. > You might still want to fragment chunks of data to avoid problems if, > say, two writers are streaming data to a single reader. In that case, > if the messages were too large compared to the amount of buffer space > available, you might get poor utilization, or even starvation. But I > would think you wouldn't need to worry about that until the message > size got fairly high. Some of the writers in Postgres-R allocate the chunk for the message in shared memory way before they send the message. I.e. during a write operation of a transaction that needs to be replicated, the backend allocates space for a message at the start of the operation, but only fills it with change set data during processing. That can possibly take quite a while. Decoupling memory allocation from message queue management allows to do this without having to copy the data. The same holds true for forwarding a message. > Well, what I was thinking about is the fact that data messages are > bigger. If I'm writing a 16-byte message once a minute and the reader > and I block each other until the message is fully read or written, > it's not really that big of a deal. If the same thing happens when > we're trying to continuously stream tuple data from one process to > another, it halves the throughput; we expect both processes to be > reading/writing almost constantly. Agreed. Unlike the proposed ring-buffer approach, the separate allocator approach doesn't have that problem, because writing itself is fully parallelized, even to the same recipient. > I think unicast messaging is really useful and I really want it, but > the requirement that it be done through dynamic shared memory > allocations feels very uncomfortable to me (as you've no doubt > gathered). Well, I on the other hand am utterly uncomfortable with having a separate solution for memory allocation per sub-system (and it definitely is an inherent problem to lots of our subsystems). Given the ubiquity of dynamic memory allocators, I don't really understand your discomfort. Thanks for discussing, I always enjoy respectful disagreement. Regards Markus Wanner
pgsql-hackers by date: