Re: [HACKERS] [POC] Faster processing at Gather node - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: [HACKERS] [POC] Faster processing at Gather node |
Date | |
Msg-id | 20180207184137.5mv4bipj3t7hbncv@alap3.anarazel.de Whole thread Raw |
In response to | Re: [HACKERS] [POC] Faster processing at Gather node (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] [POC] Faster processing at Gather node
|
List | pgsql-hackers |
Hi, On 2018-01-25 12:09:23 -0500, Robert Haas wrote: > > Perhaps a short benchmark for 32bit systems using shm_mq wouldn't hurt? > > I suspect there won't be much of a performance impact, but it's probably > > worth checking. > > I don't think I understand your concern here. If this is used on a > system where we're emulating atomics and barriers in painful ways, it > might hurt performance, but I think we have a policy of not caring. Well, it's more than just systems like that - for 64bit atomics we sometimes do fall back to spinlock based atomics on 32bit systems, even if they support 32 bit atomics. > Also, I don't know where I'd find a 32-bit system to test. You can compile with -m32 on reasonable systems ;) > >> Assert(used <= ringsize); > >> available = Min(ringsize - used, nbytes - sent); > >> > >> /* Bail out if the queue has been detached. */ > >> - if (detached) > >> + if (mq->mq_detached) > > > > Hm, do all paths here guarantee that mq->mq_detached won't be stored on > > the stack / register in the first iteration, and then not reread any > > further? I think it's fine because every branch of the if below ends up > > in a syscall / barrier, but it might be worth noting on that here. > > Aargh. I hate compilers. I added a comment. Should I think about > changing mq_detached to pg_atomic_uint32 instead? I think a pg_compiler_barrier() would suffice to alleviate my concern, right? If you wanted to go for an atomic, using pg_atomic_flag would probably be more appropriate than pg_atomic_uint32. > >> - /* Write as much data as we can via a single memcpy(). */ > >> + /* > >> + * Write as much data as we can via a single memcpy(). Make sure > >> + * these writes happen after the read of mq_bytes_read, above. > >> + * This barrier pairs with the one in shm_mq_inc_bytes_read. > >> + */ > > > > s/above/above. Otherwise a newer mq_bytes_read could become visible > > before the corresponding reads have fully finished./? > > I don't find that very clear. A newer mq_bytes_read could become > visible whenever, and a barrier doesn't prevent that from happening. Well, my point was that the barrier prevents the the write to mq_bytes_read becoming visible before the corresponding reads have finished. Which then would mean the memcpy() could overwrite them. And a barrier *does* prevent that from happening. I don't think this is the same as: > What it does is ensure (together with the one in > shm_mq_inc_bytes_read) that we don't try to read bytes that aren't > fully *written* yet. which seems much more about the barrier in shm_mq_inc_bytes_written()? > Generally, my mental model is that barriers make things happen in > program order rather than some other order that the CPU thinks would > be fun. Imagine a single-core server doing all of this stuff the "old > school" way. If the reader puts data into the queue before > advertising its presence and the writer finishes using the data from > the queue before advertising its consumption, then everything works. > If you do anything else, it's flat busted, even on that single-core > system, because a context switch could happen at any time, and then > you might read data that isn't written yet. The barrier just ensures > that we get that order of execution even on fancy modern hardware, but > I'm not sure how much of that we really need to explain here. IDK, I find it nontrivial to understand individual uses of barriers. There's often multiple non isometric ways to use barriers, and the logic why a specific one is correct isn't always obvious. > >> + pg_memory_barrier(); > >> memcpy(&mq->mq_ring[mq->mq_ring_offset + offset], > >> (char *) data + sent, sendnow); > >> sent += sendnow; > > > > Btw, this mq_ring_offset stuff seems a bit silly, why don't we use > > proper padding/union in the struct to make it unnecessary to do that bit > > of offset calculation every time? I think it currently prevents > > efficient address calculation instructions from being used. > > Well, the root cause -- aside from me being a fallible human being > with only limited programing skills -- is that I wanted the parallel > query code to be able to request whatever queue size it preferred > without having to worry about how many bytes of that space was going > to get consumed by overhead. What I meant is that instead of doing struct shm_mq { ... bool mq_detached; uint8 mq_ring_offset; char mq_ring[FLEXIBLE_ARRAY_MEMBER]; }; it'd be possible to do something like { ... bool mq_detached; union { char mq_ring[FLEXIBLE_ARRAY_MEMBER]; double forcealign; } d; }; which'd force the struct to be laid out so mq_ring is at a suitable offset. We use that in a bunch of places. As far as I understand that'd not run counter to your goals of: > without having to worry about how many bytes of that space was going > to get consumed by overhead. right? > change it up, if somebody felt like working out how the API should be > set up. I don't really want to do that right now, though. Right. > >> From 666d33a363036a647dde83cb28b9d7ad0b31f76c Mon Sep 17 00:00:00 2001 > >> From: Robert Haas <rhaas@postgresql.org> > >> Date: Sat, 4 Nov 2017 19:03:03 +0100 > >> Subject: [PATCH 2/2] shm-mq-reduce-receiver-latch-set-v1 > > > >> - /* Consume any zero-copy data from previous receive operation. */ > >> - if (mqh->mqh_consume_pending > 0) > >> + /* > >> + * If we've consumed an amount of data greater than 1/4th of the ring > >> + * size, mark it consumed in shared memory. We try to avoid doing this > >> + * unnecessarily when only a small amount of data has been consumed, > >> + * because SetLatch() is fairly expensive and we don't want to do it > >> + * too often. > >> + */ > >> + if (mqh->mqh_consume_pending > mq->mq_ring_size / 4) > >> { > > > > Hm. Why are we doing this at the level of updating the variables, rather > > than SetLatch calls? > > Hmm, I'm not sure I understand what you're suggesting, here. In > general, we return with the data for the current message unconsumed, > and then consume it the next time the function is called, so that > (except when the message wraps the end of the buffer) we can return a > pointer directly into the buffer rather than having to memcpy(). What > this patch does is postpone consuming the data further, either until > we can free up at least a quarter of the ring buffer or until we need > to wait for more data. It seemed worthwhile to free up space in the > ring buffer occasionally even if we weren't to the point of waiting > yet, so that the sender has an opportunity to write new data into that > space if it happens to still be running. What I'm trying to suggest is that instead of postponing an update of mq_bytes_read (by storing amount of already performed reads in mqh_consume_pending), we continue to update mq_bytes_read but only set the latch if your above thresholds are crossed. That way a burst of writes can fully fill the ringbuffer, but the cost of doing a SetLatch() is amortized. In my testing SetLatch() was the expensive part, not the necessary barriers in shm_mq_inc_bytes_read(). - Andres
pgsql-hackers by date: