Re: [HACKERS] [POC] Faster processing at Gather node - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] [POC] Faster processing at Gather node
Date
Msg-id 20180207184137.5mv4bipj3t7hbncv@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] [POC] Faster processing at Gather node  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] [POC] Faster processing at Gather node
List pgsql-hackers
Hi,

On 2018-01-25 12:09:23 -0500, Robert Haas wrote:
> > Perhaps a short benchmark for 32bit systems using shm_mq wouldn't hurt?
> > I suspect there won't be much of a performance impact, but it's probably
> > worth checking.
>
> I don't think I understand your concern here.  If this is used on a
> system where we're emulating atomics and barriers in painful ways, it
> might hurt performance, but I think we have a policy of not caring.

Well, it's more than just systems like that - for 64bit atomics we
sometimes do fall back to spinlock based atomics on 32bit systems, even
if they support 32 bit atomics.


> Also, I don't know where I'd find a 32-bit system to test.

You can compile with -m32 on reasonable systems ;)


> >>               Assert(used <= ringsize);
> >>               available = Min(ringsize - used, nbytes - sent);
> >>
> >>               /* Bail out if the queue has been detached. */
> >> -             if (detached)
> >> +             if (mq->mq_detached)
> >
> > Hm, do all paths here guarantee that mq->mq_detached won't be stored on
> > the stack / register in the first iteration, and then not reread any
> > further? I think it's fine because every branch of the if below ends up
> > in a syscall / barrier, but it might be worth noting on that here.
>
> Aargh.  I hate compilers.  I added a comment.  Should I think about
> changing mq_detached to pg_atomic_uint32 instead?

I think a pg_compiler_barrier() would suffice to alleviate my concern,
right? If you wanted to go for an atomic, using pg_atomic_flag would
probably be more appropriate than pg_atomic_uint32.


> >> -                     /* Write as much data as we can via a single memcpy(). */
> >> +                     /*
> >> +                      * Write as much data as we can via a single memcpy(). Make sure
> >> +                      * these writes happen after the read of mq_bytes_read, above.
> >> +                      * This barrier pairs with the one in shm_mq_inc_bytes_read.
> >> +                      */
> >
> > s/above/above. Otherwise a newer mq_bytes_read could become visible
> > before the corresponding reads have fully finished./?
>
> I don't find that very clear.  A newer mq_bytes_read could become
> visible whenever, and a barrier doesn't prevent that from happening.

Well, my point was that the barrier prevents the the write to
mq_bytes_read becoming visible before the corresponding reads have
finished. Which then would mean the memcpy() could overwrite them. And a
barrier *does* prevent that from happening.

I don't think this is the same as:

> What it does is ensure (together with the one in
> shm_mq_inc_bytes_read) that we don't try to read bytes that aren't
> fully *written* yet.

which seems much more about the barrier in shm_mq_inc_bytes_written()?


> Generally, my mental model is that barriers make things happen in
> program order rather than some other order that the CPU thinks would
> be fun.  Imagine a single-core server doing all of this stuff the "old
> school" way.  If the reader puts data into the queue before
> advertising its presence and the writer finishes using the data from
> the queue before advertising its consumption, then everything works.
> If you do anything else, it's flat busted, even on that single-core
> system, because a context switch could happen at any time, and then
> you might read data that isn't written yet.  The barrier just ensures
> that we get that order of execution even on fancy modern hardware, but
> I'm not sure how much of that we really need to explain here.

IDK, I find it nontrivial to understand individual uses of
barriers. There's often multiple non isometric ways to use barriers, and
the logic why a specific one is correct isn't always obvious.


> >> +                     pg_memory_barrier();
> >>                       memcpy(&mq->mq_ring[mq->mq_ring_offset + offset],
> >>                                  (char *) data + sent, sendnow);
> >>                       sent += sendnow;
> >
> > Btw, this mq_ring_offset stuff seems a bit silly, why don't we use
> > proper padding/union in the struct to make it unnecessary to do that bit
> > of offset calculation every time? I think it currently prevents
> > efficient address calculation instructions from being used.
>
> Well, the root cause -- aside from me being a fallible human being
> with only limited programing skills -- is that I wanted the parallel
> query code to be able to request whatever queue size it preferred
> without having to worry about how many bytes of that space was going
> to get consumed by overhead.

What I meant is that instead of doing
struct shm_mq
{
    ...
    bool        mq_detached;
    uint8        mq_ring_offset;
    char        mq_ring[FLEXIBLE_ARRAY_MEMBER];
};

it'd be possible to do something like

{
...
    bool        mq_detached;
    union {
        char        mq_ring[FLEXIBLE_ARRAY_MEMBER];
        double        forcealign;
    } d;
};

which'd force the struct to be laid out so mq_ring is at a suitable
offset. We use that in a bunch of places.

As far as I understand that'd not run counter to your goals of:
> without having to worry about how many bytes of that space was going
> to get consumed by overhead.

right?


> change it up, if somebody felt like working out how the API should be
> set up.  I don't really want to do that right now, though.

Right.


> >> From 666d33a363036a647dde83cb28b9d7ad0b31f76c Mon Sep 17 00:00:00 2001
> >> From: Robert Haas <rhaas@postgresql.org>
> >> Date: Sat, 4 Nov 2017 19:03:03 +0100
> >> Subject: [PATCH 2/2] shm-mq-reduce-receiver-latch-set-v1
> >
> >> -     /* Consume any zero-copy data from previous receive operation. */
> >> -     if (mqh->mqh_consume_pending > 0)
> >> +     /*
> >> +      * If we've consumed an amount of data greater than 1/4th of the ring
> >> +      * size, mark it consumed in shared memory.  We try to avoid doing this
> >> +      * unnecessarily when only a small amount of data has been consumed,
> >> +      * because SetLatch() is fairly expensive and we don't want to do it
> >> +      * too often.
> >> +      */
> >> +     if (mqh->mqh_consume_pending > mq->mq_ring_size / 4)
> >>       {
> >
> > Hm. Why are we doing this at the level of updating the variables, rather
> > than SetLatch calls?
>
> Hmm, I'm not sure I understand what you're suggesting, here.  In
> general, we return with the data for the current message unconsumed,
> and then consume it the next time the function is called, so that
> (except when the message wraps the end of the buffer) we can return a
> pointer directly into the buffer rather than having to memcpy().  What
> this patch does is postpone consuming the data further, either until
> we can free up at least a quarter of the ring buffer or until we need
> to wait for more data. It seemed worthwhile to free up space in the
> ring buffer occasionally even if we weren't to the point of waiting
> yet, so that the sender has an opportunity to write new data into that
> space if it happens to still be running.

What I'm trying to suggest is that instead of postponing an update of
mq_bytes_read (by storing amount of already performed reads in
mqh_consume_pending), we continue to update mq_bytes_read but only set
the latch if your above thresholds are crossed. That way a burst of
writes can fully fill the ringbuffer, but the cost of doing a SetLatch()
is amortized. In my testing SetLatch() was the expensive part, not the
necessary barriers in shm_mq_inc_bytes_read().

- Andres


pgsql-hackers by date:

Previous
From: "Bossart, Nathan"
Date:
Subject: Re: [HACKERS] pg_upgrade to clusters with a different WAL segmentsize
Next
From: Stephen Frost
Date:
Subject: Re: [HACKERS] Proposal: generic WAL compression